[jira] [Comment Edited] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

Theodore Vasiloudis (JIRA) Tue, 24 Mar 2015 08:11:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375766#comment-14375766
 ]


Theodore Vasiloudis edited comment on SPARK-2394 at 3/24/15 3:09 PM:
---------------------------------------------------------------------

Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you 
cluster at launch, which can be easily achieved by downloading this script, and 
passing it as the --user-data flag when launching you cluster:

https://gist.github.com/thvasilo/7696d21cb3205f5cb11d

For example if you saved the script to /home/user/lzo.sh you would launch your 
cluster with:

./spark-ec2 -k your-key -i your-key.pem --instance-type=m3.large 
--user-data=/home/user/lzo.sh -s 2  launch ClusterName

For Step 4, make sure that the core-site.xml in changed in both the hadoop 
config, as well as the spark-conf/ directory. Also as suggested in the 
hadoop-lzo docs:

{quote}
Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out 
the line:

{code}
    JAVA_LIBRARY_PATH=''
{code}

{quote}

Here's how I set the vars in spark-env.sh:

{code}
export 
SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/"
export 
SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar"
{code}

And what I added to both core-site.xml

{code:xml}
<property>
    <name>io.compression.codecs</name>
    
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
  </property>

  <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>
{code}

Here is an easy way to test if everything works (replace ephemeral-hdfs with 
persistent-hdfs if you are using that):

{code}
echo "hello world" > test.log
lzop test.log
ephemeral-hdfs/bin/hadoop fs -copyFromLocal test.log.lzo /user/root/test.log.lzo
#Test local
ephemeral-hdfs/bin/hadoop jar 
/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar 
com.hadoop.compression.lzo.LzoIndexer /user/root/test.log.lzo
#Test distributed
ephemeral-hdfs/bin/hadoop jar 
/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar 
com.hadoop.compression.lzo.DistributedLzoIndexer /user/root/test.log.lzo
{code}


As for the code (Step 5) itself, I've tried the different variations suggested 
in the mailing list and other places and ended up using the following:

https://gist.github.com/thvasilo/cd99709eacb44c8a8cff

Note that this uses the sequenceFile reader, specifically for the Google 
Ngrams. Setting the minPartitions is important in order to get good 
parallelization with what you with the data later on. (3*cores in your cluster 
seems like a good value)

You can run the above job using:

{code}
./bin/spark-submit --jars 
local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class 
your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg
{code}

you should of course set the env variables for you spark master and the 
location of your fat jar.
Note that I'm passing the hadoop-lzo jar as local, that assumes that every node 
has built the jar, which is done by the script given above.

Do the above and you should get the count and the first line of the data when 
running the job.


was (Author: tvas):
Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you 
cluster at launch, which can be easily achieved by downloading this script, and 
passing it as the --user-data flag when launching you cluster:

https://gist.github.com/thvasilo/7696d21cb3205f5cb11d

For example if you saved the script to /home/user/lzo.sh you would launch your 
cluster with:

./spark-ec2 -k you-key -i your-key.pem --instance-type=m3.large 
--user-data=/home/user/lzo.sh -s 2  launch ClusterName

For Step 4, make sure that the core-site.xml in changed in both the hadoop 
config, as well as the spark-conf/ directory. Also as suggested in the 
hadoop-lzo docs:

{quote}
Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out 
the line:

{code}
    JAVA_LIBRARY_PATH=''
{code}

{quote}

Here's how I set the vars in spark-env.sh:

{code}
export 
SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/"
export 
SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar"
{code}

And what I added to both core-site.xml

{code:xml}
<property>
    <name>io.compression.codecs</name>
    
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
  </property>

  <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>
{code}

Here is an easy way to test if everything works (replace ephemeral-hdfs with 
persistent-hdfs if you are using that):

{code}
echo "hello world" > test.log
lzop test.log
ephemeral-hdfs/bin/hadoop fs -copyFromLocal test.log.lzo /user/root/test.log.lzo
#Test local
ephemeral-hdfs/bin/hadoop jar 
/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar 
com.hadoop.compression.lzo.LzoIndexer /user/root/test.log.lzo
#Test distributed
ephemeral-hdfs/bin/hadoop jar 
/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar 
com.hadoop.compression.lzo.DistributedLzoIndexer /user/root/test.log.lzo
{code}


As for the code (Step 5) itself, I've tried the different variations suggested 
in the mailing list and other places and ended up using the following:

https://gist.github.com/thvasilo/cd99709eacb44c8a8cff

Note that this uses the sequenceFile reader, specifically for the Google 
Ngrams. Setting the minPartitions is important in order to get good 
parallelization with what you with the data later on. (3*cores in your cluster 
seems like a good value)

You can run the above job using:

{code}
./bin/spark-submit --jars 
local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class 
your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg
{code}

you should of course set the env variables for you spark master and the 
location of your fat jar.
Note that I'm passing the hadoop-lzo jar as local, that assumes that every node 
has built the jar, which is done by the script given above.

Do the above and you should get the count and the first line of the data when 
running the job.

> Make it easier to read LZO-compressed files from EC2 clusters
> -------------------------------------------------------------
>
>                 Key: SPARK-2394
>                 URL: https://issues.apache.org/jira/browse/SPARK-2394
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2, Input/Output
>    Affects Versions: 1.0.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>              Labels: compression
>
> Amazon hosts [a large Google n-grams data set on 
> S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is 
> perfect, among other things, for putting together interesting and easily 
> reproducible public demos of Spark's capabilities.
> The problem is that the data set is compressed using LZO, and it is currently 
> more painful than it should be to get your average {{spark-ec2}} cluster to 
> read input compressed in this way.
> This is what one has to go through to get a Spark cluster created with 
> {{spark-ec2}} to read LZO-compressed files:
> # Install the latest LZO release, perhaps via {{yum}}.
> # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build 
> it. To build {{hadoop-lzo}} you need Maven. 
> # Install Maven. For some reason, [you cannot install Maven with 
> {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
>  so install it manually.
> # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
> configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
> # Make [the appropriate 
> calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
>  to {{sc.newAPIHadoopFile}}.
> This seems like a bit too much work for what we're trying to accomplish.
> If we expect this to be a common pattern -- reading LZO-compressed files from 
> a {{spark-ec2}} cluster -- it would be great if we could somehow make this 
> less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

Reply via email to