Nicholas Chammas created SPARK-2394:
---------------------------------------
Summary: Make it easier to read LZO-compressed files from EC2
clusters
Key: SPARK-2394
URL: https://issues.apache.org/jira/browse/SPARK-2394
Project: Spark
Issue Type: Improvement
Components: EC2, Input/Output
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor
Amazon hosts [a large Google n-grams data set on
S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect,
among other things, for putting together interesting and easily reproducible
public demos of Spark's capabilities.
The problem is that the data set is compressed using LZO, and it is currently
more painful than it should be to get your average {{spark-ec2}} cluster to
read input compressed in this way.
This is what one has to go through to get a Spark cluster created with
{{spark-ec2}} to read LZO-compressed files:
# Install the latest LZO release, perhaps via {{yum}}.
# Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it.
To build {{hadoop-lzo}} you need Maven.
# Install Maven. For some reason, [you cannot install Maven with
{{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
so install it manually.
# Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate
configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
# Make [the appropriate
calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
to {{sc.newAPIHadoopFile}}.
This seems like a bit too much work for what we're trying to accomplish.
If we expect this to be a common pattern -- reading LZO-compressed files from a
{{spark-ec2}} cluster -- it would be great if we could somehow make this less
painful.
--
This message was sent by Atlassian JIRA
(v6.2#6252)