Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-12 Thread Haoyuan Li
This link should be helpful:
https://alluxio.org/docs/1.7/en/Running-Spark-on-Alluxio.html

Best regards,

Haoyuan (HY)

alluxio.com  | alluxio.org
 | powered
by Alluxio 


On Thu, Apr 12, 2018 at 6:32 PM, jb44  wrote:

> I'm running spark in LOCAL mode and trying to get it to talk to alluxio.
> I'm
> getting the error: java.lang.ClassNotFoundException: Class
> alluxio.hadoop.FileSystem not found
> The cause of this error is apparently that Spark cannot find the alluxio
> client jar in its classpath.
>
> I have looked at the page here:
> https://www.alluxio.org/docs/master/en/Debugging-Guide.
> html#q-why-do-i-see-exceptions-like-javalangruntimeexception-
> javalangclassnotfoundexception-class-alluxiohadoopfilesystem-not-found
>
> Which details the steps to take in this situation, but I'm not finding
> success.
>
> According to Spark documentation, I can instance a local Spark like so:
>
> SparkSession.builder
>   .appName("App")
>   .getOrCreate
>
> Then I can add the alluxio client library like so:
> sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)
> sparkSession.conf.set("spark.executor.extraClassPath",
> ALLUXIO_SPARK_CLIENT)
>
> I have verified that the proper jar file exists in the right location on my
> local machine with:
> logger.error(sparkSession.conf.get("spark.driver.extraClassPath"))
> logger.error(sparkSession.conf.get("spark.executor.extraClassPath"))
>
> But I still get the error. Is there anything else I can do to figure out
> why
> Spark is not picking the library up?
>
> Please note I am not using spark-submit - I am aware of the methods for
> adding the client jar to a spark-submit job. My Spark instance is being
> created as local within my application and this is the use case I want to
> solve.
>
> As an FYI there is another application in the cluster which is connecting
> to
> my alluxio using the fs client and that all works fine. In that case,
> though, the fs client is being packaged as part of the application through
> standard sbt dependencies.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Writing files to s3 with out temporary directory

2017-11-22 Thread Haoyuan Li
This blog / tutorial
 maybe
helpful to run Spark in the Cloud with Alluxio.

Best regards,

Haoyuan

On Mon, Nov 20, 2017 at 2:12 PM, lucas.g...@gmail.com 
wrote:

> That sounds like allot of work and if I understand you correctly it sounds
> like a piece of middleware that already exists (I could be wrong).  Alluxio?
>
> Good luck and let us know how it goes!
>
> Gary
>
> On 20 November 2017 at 14:10, Jim Carroll  wrote:
>
>> Thanks. In the meantime I might just write a custom file system that maps
>> writes to parquet file parts to their final locations and then skips the
>> move.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread Haoyuan Li
Yes. Tachyon can handle this well: http://tachyon-project.org/

Best,

Haoyuan

On Wed, Jul 22, 2015 at 10:56 AM, swetha  wrote:

> Hi,
>
> We have a requirement wherein we need to keep RDDs in memory between Spark
> batch processing that happens every one hour. The idea here is to have RDDs
> that have active user sessions in memory between two jobs so that once a
> job
> processing is  done and another job is run after an hour the RDDs with
> active sessions are still available for joining with those in the current
> job. So, what do we need to keep the data in memory in between two batch
> jobs? Can we use Tachyon?
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Haoyuan Li
CEO, Tachyon Nexus <http://www.tachyonnexus.com/>


Re: How to stop making Multiple copies in memory when running multiple Spark jobs?

2015-07-05 Thread Haoyuan Li
You can also find more info here:
http://tachyon-project.org/master/Running-Spark-on-Tachyon.html

Hope this helps.

Haoyuan

On Tue, Jun 30, 2015 at 11:28 PM, Himanshu Mehra <
himanshumehra@gmail.com> wrote:

> Hi neprasad,
>
> You should give a try to Tachyon system. or any other in memory db. here
> you
> have the  link
> <
> http://www.slideshare.net/haoyuanli/tachyon20141121ampcamp5-41881671?related=1
> >
> of a slideshow explaining all this. what you want is not 10th slide but the
> whole slide is worth reading.
>
> Thank you.
>
> Himanshu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-stop-making-Multiple-copies-in-memory-when-running-multiple-Spark-jobs-tp23534p23562.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Fast big data analytics with Spark on Tachyon in Baidu

2015-05-12 Thread Haoyuan Li
Dear all,

We’re organizing a meetup  on
May 28th at IBM in Forster City that might be of interest to the Spark
community. The focus is a production use case of Spark and Tachyon at Baidu.

You can sign up here: http://www.meetup.com/Tachyon/events/222485713/

Hope some of you can make it!

Best,

Haoyuan


Re: tachyon on machines launched with spark-ec2 scripts

2015-04-24 Thread Haoyuan Li
Daniel,

Instead of using localhost:19998, you may want to use the real ip address
TachyonMaster is configured. You should be able to see more info in
Tachyon's UI as well. More info could be found here:
http://tachyon-project.org/master/Running-Tachyon-on-EC2.html

Best,

Haoyuan

On Fri, Apr 24, 2015 at 12:21 PM, Daniel Mahler  wrote:

> I have a cluster launched with spark-ec2.
> I can see a TachyonMaster process running,
> but I do not seem to be able to use tachyon from the spark-shell.
>
> if I try
>
> rdd.saveAsTextFile("tachyon://localhost:19998/path")
> I get
>
> 15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0
> (TID 216, ip-10-63-69-48.ec2.internal, PROCESS_LOCAL, 1383 bytes)
> 15/04/24 19:18:31 WARN TaskSetManager: Lost task 32.2 in stage 1.0 (TID
> 177, ip-10-63-69-48.ec2.internal): java.io.IOException: Failed to connect
> to master localhost/127.0.0.1:19998 after 5 attempts
> at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
> at tachyon.client.TachyonFS.getUnderfsAddress(TachyonFS.java:1224)
> at tachyon.hadoop.TFS.initialize(TFS.java:289)
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2262)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:86)
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2296)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
> at
> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:83)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: tachyon.org.apache.thrift.TException: Failed to connect to
> master localhost/127.0.0.1:19998 after 5 attempts
> at tachyon.master.MasterClient.connect(MasterClient.java:178)
> at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
> ... 17 more
> Caused by: tachyon.org.apache.thrift.transport.TTransportException:
> java.net.ConnectException: Connection refused
> at
> tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
> at
> tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
> at tachyon.master.MasterClient.connect(MasterClient.java:156)
> ... 18 more
> Caused by: java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)
>     at
> tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180)
> ... 20 more
>
>
>  What do I need to do before I can use tachyon?
>
> thanks
> Daniel
>



-- 
Haoyuan Li
CEO, Tachyon Nexus <http://www.tachyonnexus.com/>
AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/


Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li
Response inline.

On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun  wrote:

> (resending...)
>
> I was thinking the same setup… But the more I think of this problem, and
> the more interesting this could be.
>
> If we allocate 50% total memory to Tachyon statically, then the Mesos
> benefits of dynamically scheduling resources go away altogether.
>

People can still benefits from Mesos' dynamically scheduling of the rest
memory as well as compute resource.


>
> Can Tachyon be resource managed by Mesos (dynamically)? Any thought or
> comment?
>


This requires some integration work.

Best,

Haoyuan


>
> Sean
>>
>>
>>
>>
>>
>> >Hi Haoyuan,
>>
>> >So on each mesos slave node I should allocate/section off some amount
>> >of memory for tachyon (let's say 50% of the total memory) and the rest
>> >for regular mesos tasks?
>>
>> >This means, on each slave node I would have tachyon worker (+ hdfs
>> >configuration to talk to s3 or the hdfs datanode) and the mesos slave
>> ?process. Is this correct?
>>
>>
>>
>
>
> --
> --Sean
>
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
Ankur,

Response inline.

On Tue, Mar 31, 2015 at 4:49 PM, Ankur Chauhan 
wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi Haoyuan,
>
> So on each mesos slave node I should allocate/section off some amount
> of memory for tachyon (let's say 50% of the total memory) and the rest
> for regular mesos tasks?
>
>
This depends on your machine spec and workload. The high level idea is to
give Tachyon the memory size equals to the total memory size of the machine
minus other processes' memory needs.



> This means, on each slave node I would have tachyon worker (+ hdfs
> configuration to talk to s3 or the hdfs datanode) and the mesos slave
> process. Is this correct?
>


On each slave node, you would run a Tachyon worker. For underfs, you can
configure it to use S3 or HDFS or others.

Best,

Haoyuan


>
> On 31/03/2015 16:43, Haoyuan Li wrote:
> > Tachyon should be co-located with Spark in this case.
> >
> > Best,
> >
> > Haoyuan
> >
> > On Tue, Mar 31, 2015 at 4:30 PM, Ankur Chauhan
> > mailto:achau...@brightcove.com>> wrote:
> >
> > Hi,
> >
> > I am fairly new to the spark ecosystem and I have been trying to
> > setup a spark on mesos deployment. I can't seem to figure out the
> > "best practices" around HDFS and Tachyon. The documentation about
> > Spark's data-locality section seems to point that each of my mesos
> > slave nodes should also run a hdfs datanode. This seems fine but I
> > can't seem to figure out how I would go about pushing tachyon into
> > the mix.
> >
> > How should i organize my cluster? Should tachyon be colocated on my
> > mesos worker nodes? or should all the spark jobs reach out to a
> > separate hdfs/tachyon cluster.
> >
> > -- Ankur Chauhan
> >
> > -----
> >
> >
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > <mailto:user-unsubscr...@spark.apache.org> For additional commands,
> > e-mail: user-h...@spark.apache.org
> > <mailto:user-h...@spark.apache.org>
> >
> >
> >
> >
> > -- Haoyuan Li AMPLab, EECS, UC Berkeley
> > http://www.cs.berkeley.edu/~haoyuan/
>
> - --
> - -- Ankur Chauhan
> -BEGIN PGP SIGNATURE-
>
> iQEcBAEBAgAGBQJVGzKUAAoJEOSJAMhvLp3L3W4IAIVYiEKIZbC1a36/KWo94xYB
> dvE4VXxF7z5FWmpuaHBEa+U1XWrR4cLVsQhocusOFn+oC7bstdltt3cGNAuwFSv6
> Oogs4Sl1J4YZm8omKVdCkwD6Hv71HSntM8llz3qTW+Ljk2aKhfvNtp5nioQAm3e+
> bs4ZKlCBij/xV3LbYYIePSS3lL0d9m1qEDJvi6jFcfm3gnBYeNeL9x92B5ylyth0
> BGHnPN4sV/yopgrqOimLb12gSexHGNP1y6JBYy8NrHRY8SxkZ4sWKuyDnGDCOPOc
> HC14Parf5Ly5FEz5g5WjF6HrXRdPlgr2ABxSLWOAB/siXsX9o/4yCy7NtDNcL6Y=
> =f2xI
> -END PGP SIGNATURE-
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
Tachyon should be co-located with Spark in this case.

Best,

Haoyuan

On Tue, Mar 31, 2015 at 4:30 PM, Ankur Chauhan 
wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi,
>
> I am fairly new to the spark ecosystem and I have been trying to setup
> a spark on mesos deployment. I can't seem to figure out the "best
> practices" around HDFS and Tachyon. The documentation about Spark's
> data-locality section seems to point that each of my mesos slave nodes
> should also run a hdfs datanode. This seems fine but I can't seem to
> figure out how I would go about pushing tachyon into the mix.
>
> How should i organize my cluster?
> Should tachyon be colocated on my mesos worker nodes? or should all
> the spark jobs reach out to a separate hdfs/tachyon cluster.
>
> - -- Ankur Chauhan
> -BEGIN PGP SIGNATURE-
>
> iQEcBAEBAgAGBQJVGy4bAAoJEOSJAMhvLp3L5bkH/0MECyZkh3ptWzmsNnSNfGWp
> Oh93TUfD+foXO2ya9D+hxuyAxbjfXs/68aCWZsUT6qdlBQU9T1vX+CmPOnpY1KPN
> NJP3af+VK0osaFPo6k28OTql1iTnvb9Nq+WDlohxBC/hZtoYl4cVxu8JmRlou/nb
> /wfpp0ShmJnlxsoPa6mVdwzjUjVQAfEpuet3Ow5veXeA9X7S55k/h0ZQrZtO8eXL
> jJsKaT8ne9WZPhZwA4PkdzTxkXF3JNveCIKPzNttsJIaLlvd0nLA/wu6QWmxskp6
> iliGSmEk5P1zZWPPnk+TPIqbA0Ttue7PeXpSrbA9+pYiNT4R/wAneMvmpTABuR4=
> =8ijP
> -END PGP SIGNATURE-
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Haoyuan Li
Did you recompile it with Tachyon 0.6.0?

Also, Tachyon 0.6.1 has been released this morning:
http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases

Best regards,

Haoyuan

On Wed, Mar 18, 2015 at 11:48 AM, Ranga  wrote:

> I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
> issue. Here are the logs:
> 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
> 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
> 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to
> create tachyon dir in
> /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/
>
> Thanks for any other pointers.
>
>
> - Ranga
>
>
>
> On Wed, Mar 18, 2015 at 9:53 AM, Ranga  wrote:
>
>> Thanks for the information. Will rebuild with 0.6.0 till the patch is
>> merged.
>>
>> On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu  wrote:
>>
>>> Ranga:
>>> Take a look at https://github.com/apache/spark/pull/4867
>>>
>>> Cheers
>>>
>>> On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com 
>>> wrote:
>>>
>>>> Hi, Ranga
>>>>
>>>> That's true. Typically a version mis-match issue. Note that spark 1.2.1
>>>> has tachyon built in with version 0.5.0 , I think you may need to rebuild
>>>> spark
>>>> with your current tachyon release.
>>>> We had used tachyon for several of our spark projects in a production
>>>> environment.
>>>>
>>>> Thanks,
>>>> Sun.
>>>>
>>>> --
>>>> fightf...@163.com
>>>>
>>>>
>>>> *From:* Ranga 
>>>> *Date:* 2015-03-18 06:45
>>>> *To:* user@spark.apache.org
>>>> *Subject:* StorageLevel: OFF_HEAP
>>>> Hi
>>>>
>>>> I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
>>>> cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
>>>> However, when I try to persist the RDD, I get the following error:
>>>>
>>>> ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
>>>> TachyonFS.java[connect]:364)  - Invalid method name:
>>>> 'getUserUnderfsTempFolder'
>>>> ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
>>>> TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'
>>>>
>>>> Is this because of a version mis-match?
>>>>
>>>> On a different note, I was wondering if Tachyon has been used in a
>>>> production environment by anybody in this group?
>>>>
>>>> Appreciate your help with this.
>>>>
>>>>
>>>> - Ranga
>>>>
>>>>
>>>
>>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: RE: Building spark over specified tachyon

2015-03-15 Thread Haoyuan Li
Here is a patch: https://github.com/apache/spark/pull/4867

On Sun, Mar 15, 2015 at 8:46 PM, fightf...@163.com 
wrote:

> Thanks, Jerry
> I got that way. Just to make sure whether there can be some option to
> directly
> specifying tachyon version.
>
>
> --
> fightf...@163.com
>
>
> *From:* Shao, Saisai 
> *Date:* 2015-03-16 11:10
> *To:* fightf...@163.com
> *CC:* user 
> *Subject:* RE: Building spark over specified tachyon
>
> I think you could change the pom file under Spark project to update the
> Tachyon related dependency version and rebuild it again (in case API is
> compatible, and behavior is the same).
>
>
>
> I’m not sure is there any command you can use to compile against Tachyon
> version.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* fightf...@163.com [mailto:fightf...@163.com]
> *Sent:* Monday, March 16, 2015 11:01 AM
> *To:* user
> *Subject:* Building spark over specified tachyon
>
>
>
> Hi, all
>
> Noting that the current spark releases are built-in with tachyon 0.5.0 ,
>
> if we want to recompile spark with maven and targeting on specific tachyon
> version (let's say the most recent 0.6.0 release),
>
> how should that be done? What maven compile command should be like ?
>
>
>
> Thanks,
>
> Sun.
>
>
>  --
>
> fightf...@163.com
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: Spark or Tachyon: capture data lineage

2015-01-02 Thread Haoyuan Li
Jerry,

Great question. Spark and Tachyon capture lineage information at different
granularities. We are working on an integration between Spark/Tachyon about
this. Hope to get it ready to be released soon.

Best,

Haoyuan

On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam  wrote:

> Hi spark developers,
>
> I was thinking it would be nice to extract the data lineage information
> from a data processing pipeline. I assume that spark/tachyon keeps this
> information somewhere. For instance, a data processing pipeline uses
> datasource A and B to produce C. C is then used by another process to
> produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
> useful if there is a way to capture this information when we are using
> spark/tachyon to query this data lineage information. For example, give me
> datasets that produce E. It should give me  a graph like (A and B)->C->E.
>
> Is this something already possible with spark/tachyon? If not, do you
> think it is possible? Does anyone mind to share their experience in
> capturing the data lineage in a data processing pipeline?
>
> Best Regards,
>
> Jerry
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: spark broadcast unavailable

2014-12-10 Thread Haoyuan Li
Which Hadoop version are you using? Seems the exception you got was caused
by incompatible hadoop version.

Best,

Haoyuan

On Wed, Dec 10, 2014 at 12:30 AM, 十六夜涙  wrote:

> Hi All,
> I'v read official docs of tachyon,It seems not fit my usage,For my
> understanding,‍It just cache files in memory,but I have a file contains
> over million lines amount about 70mb,retrieveing data and mapping to a
> *Map* varible will costs over serveral minuts,which I dont want to
> process it each time in map function.since tachyon occurs another problem
> raise an exception while doing *./bin/tachyon format*
> The exception:
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot
> communicate with client version 4
> ‍It seems there's a compatibility problem with hadoop,but even solved it
> there's still an efficient issue as I described above.‍‍
> could somebody tell me how to  persist the data in memory.for now I just
> broadcast it, and re-submit spark application while the broadcast value
> unavaible.‍
>
>
>
> -- 原始邮件 --
> *发件人:* "Akhil Das";;
> *发送时间:* 2014年12月9日(星期二) 下午3:42
> *收件人:* "十六夜涙";
> *抄送:* "user";
> *主题:* Re: spark broadcast unavailable
>
> You cannot pass the sc object (*val b = Utils.load(sc,ip_lib_path)*)
> inside a map function and that's why the Serialization exception is popping
> up( since sc is not serializable). You can try tachyon's cache if you want
> to persist the data in memory kind of forever.
>
> Thanks
> Best Regards
>
> On Tue, Dec 9, 2014 at 12:12 PM, 十六夜涙  wrote:
>
>> Hi all
>> In my spark application,I load a csv file and map the datas to a Map
>> vairable for later uses on driver node ,then broadcast it,every thing works
>> fine untill the exception java.io.FileNotFoundException occurs.the console
>> log information shows me the broadcast unavailable,I googled this
>> problem,says spark will  clean up the broadcast,while these's an
>> solution,the author mentioned about re-broadcast,I followed this
>> way,written some exception handle code with `try` ,`catch`.after compliling
>> and submitting the jar,I faced anthoner problem,It shows " task
>> not serializable‍".‍‍‍
>> so here I have  there options:
>> 1,get the right way persisting broadcast
>> 2,solve the "task not serializable" problem re-broadcast variable
>> 3,save the data to some kind of database,although I prefer save data in
>> memory.
>>
>> here is come code snippets:
>>   val esRdd = kafkaDStreams.flatMap(_.split("\\n"))
>>   .map{
>>   case esregex(datetime, time_request) =>
>> var ipInfo:Array[String]=Array.empty
>> try{
>> ipInfo = Utils.getIpInfo(client_ip,b.value)
>> }catch{
>>   case e:java.io.FileNotFoundException =>{
>> val b = Utils.load(sc,ip_lib_path)
>> ipInfo = Utils.getIpInfo(client_ip,b.value)
>>   }
>> }
>> ‍
>>
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: Persist kafka streams to text file, tachyon error?

2014-11-22 Thread Haoyuan Li
StorageLevel.OFF_HEAP requires to run Tachyon:
http://spark.apache.org/docs/latest/programming-guide.html

If you don't know if you have tachyon or not, you probably don't :)
http://tachyon-project.org/

For local testing, you can use other persist() solutions without running
Tachyon.

Best,

Haoyuan

On Fri, Nov 21, 2014 at 2:48 PM, Joanne Contact 
wrote:

> use the right email list.
> -- Forwarded message --
> From: Joanne Contact 
> Date: Fri, Nov 21, 2014 at 2:32 PM
> Subject: Persist kafka streams to text file
> To: u...@spark.incubator.apache.org
>
>
> Hello I am trying to read kafka stream to a text file by running spark
> from my IDE (IntelliJ IDEA) . The code is similar as a previous thread on
> persisting stream to a text file.
>
> I am new to spark or scala. I believe the spark is on local mode as the
> console shows
> 14/11/21 14:17:11 INFO spark.SparkContext: Spark configuration:
> spark.app.name=local-mode
>
>  I got the following errors. It is related to Tachyon. But I don't know if
> I have tachyon or not.
>
> 14/11/21 14:17:54 WARN storage.TachyonBlockManager: Attempt 1 to create
> tachyon dir null failed
> java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998
> after 5 attempts
> at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
> at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
> at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
> at
> org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
> at
> org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
> org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
> at
> org.apache.spark.storage.TachyonBlockManager.(TachyonBlockManager.scala:57)
> at
> org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:88)
> at
> org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:82)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:729)
> at
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: tachyon.org.apache.thrift.TException: Failed to connect to
> master localhost/127.0.0.1:19998 after 5 attempts
> at tachyon.master.MasterClient.connect(MasterClient.java:178)
> at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
> ... 28 more
> Caused by: tachyon.org.apache.thrift.transport.TTransportException:
> java.net.ConnectException: Connection refused
> at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
> at
> tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
> at tachyon.master.MasterClient.connect(MasterClient.java:156)
> ... 29 more
> Caused by: java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)
> at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180)
> ... 31 more
> 14/11/21 14:17:54 ERROR storage.TachyonBlockManager: Fail

Re: Saving very large data sets as Parquet on S3

2014-10-24 Thread Haoyuan Li
Daniel,

Currently, having Tachyon will at least help on the input part in this case.

Haoyuan

On Fri, Oct 24, 2014 at 2:01 PM, Daniel Mahler  wrote:

> I am trying to convert some json logs to Parquet and save them on S3.
> In principle this is just
>
> import org.apache.spark._
> val sqlContext = new sql.SQLContext(sc)
> val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8)
> data.registerAsTable("data")
> data.saveAsParquetFile("s3n://target/path)
>
> This works fine for up to about a 10^9 records, but above that I start
> having problems.
> The first problem I encountered is that after the data file get written out
> writing the Parquet summary file fails.
> While I seem to have all the data saved out,
> programs have a huge have a huge start up time
> when processing parquet files without a summary file.
>
> Writing  the summary file appears to primarily depend
> on on the number of partitions being written,
> and relatively independent of the amount of being written.
> Problems start after about a 1000 partitions,
> writing 1 partitions fails even with repartitioned one days worth of
> data.
>
> My data is very finely partitioned, about 16 log files per hour, or 13K
> files per month.
> The file sizes are very uneven, ranging over several orders of magnitude.
> There are several years of data.
> By my calculations this will produce 10s of terabytes of Parquet files.
>
> The first thing I tried to get around this problem
>  was  passing the data through `coalesce(1000, shuffle=false)` before
> writing.
> This works up to about a month worth of data,
> after that coalescing to 1000 partitions produces parquet files larger
> than 5G
> and writing to S3 fails as a result.
> Also coalescing slows processing down by at least a factor of 2.
> I do not understand why this should happen since I use shuffle=false.
> AFAIK coalesce should just be a bookkeeping trick and the original
> partitions should be processed pretty much the same as before, just with
> their outputs concatenated.
>
> The only other option I can think of is to write each month coalesced
> as a separate data set with its own summary file
> and union the RDDs when processing the data,
> but I do not know how much overhead that will introduce.
>
> I am looking for advice on the best way to save this size data in Parquet
> on S3.
> Apart from solving the the summary file issue i am also looking for ways
> to improve performance.
> Would it make sense to write the data to a local hdfs first and push it to
> S3 with `hadoop distcp`?
> Is putting Tachyon in front of either the input or the output S3 likely to
> help?
> If yes which is likely to help more?
>
> I set options on the master as follows
>
> +
> cat <>~/spark/conf/spark-defaults.conf
> spark.serializerorg.apache.spark.serializer.KryoSerializer
> spark.rdd.compress  true
> spark.shuffle.consolidateFiles  true
> spark.akka.frameSize  20
> EOF
>
> copy-dir /root/spark/conf
> spark/sbin/stop-all.sh
> sleep 5
> spark/sbin/start-all.
> ++
>
> Does this make sense? Should I set some other options?
> I have also asked these questions on StackOverflow where I reproduced the
> full error messages:
>
> +
> http://stackoverflow.com/questions/26332542/how-to-save-a-multi-terabyte-schemardd-in-parquet-format-on-s3
> +
> http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark
> +
> http://stackoverflow.com/questions/26291165/spark-sql-unable-to-complete-writing-parquet-data-with-a-large-number-of-shards
>
> thanks
> Daniel
>
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Fwd: Second Bay Area Tachyon meetup: October 21st, hosted by Pivotal (Limited Space)

2014-10-02 Thread Haoyuan Li
-- Forwarded message --
From: Haoyuan Li 
Date: Thu, Oct 2, 2014 at 10:12 AM
Subject: Second Bay Area Tachyon meetup: October 21st, hosted by Pivotal
(Limited Space)
To: tachyon-us...@googlegroups.com


Hi folks,

We've posted the second Tachyon meetup featuring exciting updates and
lessons learned from Tachyon deployments in production, which will be on
October 21st and is hosted by Pivotal (Limited Space):
http://www.meetup.com/Tachyon/events/210665202 . Hope to see you there!

Haoyuan

-- 
You received this message because you are subscribed to the Google Groups
"Tachyon Users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tachyon-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Haoyuan Li
Hi folks,

We've posted the first Tachyon meetup, which will be on August 25th and is
hosted by Yahoo! (Limited Space):
http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there!

Best,

Haoyuan

-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread Haoyuan Li
Is the speculative execution enabled?

Best,

Haoyuan


On Mon, Aug 11, 2014 at 8:08 AM, chutium  wrote:

> sharing /reusing RDDs is always useful for many use cases, is this possible
> via persisting RDD on tachyon?
>
> such as off heap persist a named RDD into a given path (instead of
> /tmp_spark_tachyon/spark-xxx-xxx-xxx)
> or
> saveAsParquetFile on tachyon
>
> i tried to save a SchemaRDD on tachyon,
>
> val parquetFile =
>
> sqlContext.parquetFile("hdfs://test01.zala:8020/user/hive/warehouse/parquet_tables.db/some_table/")
> parquetFile.saveAsParquetFile("tachyon://test01.zala:19998/parquet_1")
>
> but always error, first error message is:
>
> 14/08/11 16:19:28 INFO storage.BlockManagerInfo: Added broadcast_6_piece0
> in
> memory on test03.zala:37377 (size: 18.7 KB, free: 16.6 GB)
> 14/08/11 16:20:06 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 3.0
> (TID 35, test04.zala): java.io.IOException:
> FailedToCheckpointException(message:Failed to rename
> hdfs://test01.zala:8020/tmp/tachyon/workers/140776003/31806/730 to
> hdfs://test01.zala:8020/tmp/tachyon/data/730)
> tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:112)
> tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:168)
> tachyon.client.FileOutStream.close(FileOutStream.java:104)
>
>
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
>
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
> parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
>
>
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
>
> parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:259)
>
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
>
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:722)
>
>
>
> hdfs://test01.zala:8020/tmp/tachyon/
> already chmod to 777, both owner and group is same as spark/tachyon startup
> user
>
> off-heap persist or saveAs normal text file on tachyon works fine.
>
> CDH 5.1.0, spark 1.1.0 snapshot, tachyon 0.6 snapshot
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/share-reuse-off-heap-persisted-tachyon-RDD-in-SparkContext-or-saveAsParquetFile-on-tachyon-in-SQLCont-tp11897.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/