[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

Nicholas Chammas (JIRA) Mon, 23 Mar 2015 07:13:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375929#comment-14375929
 ]


Nicholas Chammas commented on SPARK-2394:
-----------------------------------------

Thank you for posting this information for others!

> Make it easier to read LZO-compressed files from EC2 clusters
> -------------------------------------------------------------
>
>                 Key: SPARK-2394
>                 URL: https://issues.apache.org/jira/browse/SPARK-2394
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2, Input/Output
>    Affects Versions: 1.0.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>              Labels: compression
>
> Amazon hosts [a large Google n-grams data set on 
> S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is 
> perfect, among other things, for putting together interesting and easily 
> reproducible public demos of Spark's capabilities.
> The problem is that the data set is compressed using LZO, and it is currently 
> more painful than it should be to get your average {{spark-ec2}} cluster to 
> read input compressed in this way.
> This is what one has to go through to get a Spark cluster created with 
> {{spark-ec2}} to read LZO-compressed files:
> # Install the latest LZO release, perhaps via {{yum}}.
> # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build 
> it. To build {{hadoop-lzo}} you need Maven. 
> # Install Maven. For some reason, [you cannot install Maven with 
> {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
>  so install it manually.
> # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
> configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
> # Make [the appropriate 
> calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
>  to {{sc.newAPIHadoopFile}}.
> This seems like a bit too much work for what we're trying to accomplish.
> If we expect this to be a common pattern -- reading LZO-compressed files from 
> a {{spark-ec2}} cluster -- it would be great if we could somehow make this 
> less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

Reply via email to