[ https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375929#comment-14375929 ]
Nicholas Chammas commented on SPARK-2394: ----------------------------------------- Thank you for posting this information for others! > Make it easier to read LZO-compressed files from EC2 clusters > ------------------------------------------------------------- > > Key: SPARK-2394 > URL: https://issues.apache.org/jira/browse/SPARK-2394 > Project: Spark > Issue Type: Improvement > Components: EC2, Input/Output > Affects Versions: 1.0.0 > Reporter: Nicholas Chammas > Priority: Minor > Labels: compression > > Amazon hosts [a large Google n-grams data set on > S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is > perfect, among other things, for putting together interesting and easily > reproducible public demos of Spark's capabilities. > The problem is that the data set is compressed using LZO, and it is currently > more painful than it should be to get your average {{spark-ec2}} cluster to > read input compressed in this way. > This is what one has to go through to get a Spark cluster created with > {{spark-ec2}} to read LZO-compressed files: > # Install the latest LZO release, perhaps via {{yum}}. > # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build > it. To build {{hadoop-lzo}} you need Maven. > # Install Maven. For some reason, [you cannot install Maven with > {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], > so install it manually. > # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate > configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. > # Make [the appropriate > calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] > to {{sc.newAPIHadoopFile}}. > This seems like a bit too much work for what we're trying to accomplish. > If we expect this to be a common pattern -- reading LZO-compressed files from > a {{spark-ec2}} cluster -- it would be great if we could somehow make this > less painful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org