[
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375929#comment-14375929
]
Nicholas Chammas commented on SPARK-2394:
-----------------------------------------
Thank you for posting this information for others!
> Make it easier to read LZO-compressed files from EC2 clusters
> -------------------------------------------------------------
>
> Key: SPARK-2394
> URL: https://issues.apache.org/jira/browse/SPARK-2394
> Project: Spark
> Issue Type: Improvement
> Components: EC2, Input/Output
> Affects Versions: 1.0.0
> Reporter: Nicholas Chammas
> Priority: Minor
> Labels: compression
>
> Amazon hosts [a large Google n-grams data set on
> S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is
> perfect, among other things, for putting together interesting and easily
> reproducible public demos of Spark's capabilities.
> The problem is that the data set is compressed using LZO, and it is currently
> more painful than it should be to get your average {{spark-ec2}} cluster to
> read input compressed in this way.
> This is what one has to go through to get a Spark cluster created with
> {{spark-ec2}} to read LZO-compressed files:
> # Install the latest LZO release, perhaps via {{yum}}.
> # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build
> it. To build {{hadoop-lzo}} you need Maven.
> # Install Maven. For some reason, [you cannot install Maven with
> {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
> so install it manually.
> # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate
> configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
> # Make [the appropriate
> calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
> to {{sc.newAPIHadoopFile}}.
> This seems like a bit too much work for what we're trying to accomplish.
> If we expect this to be a common pattern -- reading LZO-compressed files from
> a {{spark-ec2}} cluster -- it would be great if we could somehow make this
> less painful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]