[ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375766#comment-14375766
 ] 

Theodore Vasiloudis commented on SPARK-2394:
--------------------------------------------

Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you 
cluster:

https://gist.github.com/thvasilo/7696d21cb3205f5cb11d

There should be an easy way to execute this script when the cluster is being 
launched, I tried using the --user-data flag but that doesn't seem to do that. 
Otherwise you'd have to rsync this script into each machine (easy, use 
~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh 
into each machine and run it (not so easy)

For Step 4, make sure that the core-site.xml in changed in both the hadoop 
config, as well as the spark-conf/ directory. Also as suggested in the 
hadoop-lzo docs 

{quote}
Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out 
the line:

{code}
    JAVA_LIBRARY_PATH=''
{code}

{quote}

Here's how I set the vars in spark-env.sh:

{code}
export 
SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/"
export 
SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar"
{code}

And what I added to both core-site.xml

{code:xml}
<property>
    <name>io.compression.codecs</name>
    
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
  </property>

  <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>
{code}


As for the code (Step 5) itself, I've tried the different variations suggested 
in the mailing list and other places and ended up using the following:

https://gist.github.com/thvasilo/cd99709eacb44c8a8cff

Note that this uses the sequenceFile reader, specifically for the Google 
Ngrams. Setting the minPartitions is important in order to get good 
parallelization with what you with the data later on. (3*cores in your cluster 
seems like a good value)

You can run the above job using:

{code}
./bin/spark-submit --jars 
local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class 
your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg
{code}

you should of course set the env variables for you spark master and the 
location of your fat jar.
Note that I'm passing the hadoop-lzo jar as local, that assumes that every node 
has built the jar, which is done by the script given above.

Do the above and you should get the count and the first line of the data when 
running the job.

> Make it easier to read LZO-compressed files from EC2 clusters
> -------------------------------------------------------------
>
>                 Key: SPARK-2394
>                 URL: https://issues.apache.org/jira/browse/SPARK-2394
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2, Input/Output
>    Affects Versions: 1.0.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>              Labels: compression
>
> Amazon hosts [a large Google n-grams data set on 
> S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is 
> perfect, among other things, for putting together interesting and easily 
> reproducible public demos of Spark's capabilities.
> The problem is that the data set is compressed using LZO, and it is currently 
> more painful than it should be to get your average {{spark-ec2}} cluster to 
> read input compressed in this way.
> This is what one has to go through to get a Spark cluster created with 
> {{spark-ec2}} to read LZO-compressed files:
> # Install the latest LZO release, perhaps via {{yum}}.
> # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build 
> it. To build {{hadoop-lzo}} you need Maven. 
> # Install Maven. For some reason, [you cannot install Maven with 
> {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
>  so install it manually.
> # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
> configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
> # Make [the appropriate 
> calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
>  to {{sc.newAPIHadoopFile}}.
> This seems like a bit too much work for what we're trying to accomplish.
> If we expect this to be a common pattern -- reading LZO-compressed files from 
> a {{spark-ec2}} cluster -- it would be great if we could somehow make this 
> less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to