Re: reading compress lzo files

Andrew Ash Sun, 06 Jul 2014 14:17:35 -0700

Ni Nick,

The cluster I was working on in those linked messages was a private data
center cluster, not on EC2.  I'd imagine that the setup would be pretty
similar, but I'm not familiar with the EC2 init scripts that Spark uses.


Also I upgraded that cluster to 1.0 recently and am continuing to use
LZO-compressed data, so I know there's not a version issue.

Andrew


On Sun, Jul 6, 2014 at 12:02 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I’ve been reading through several pages trying to figure out how to set up
> my spark-ec2 cluster to read LZO-compressed files from S3.
>
>    -
>    
> http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E
>    -
>    
> http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E
>    - https://github.com/twitter/hadoop-lzo
>    -
>    
> http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
>
> It seems that several things may have changed since the above pages were
> put together, so getting this to work is more work than I expected.
>
> Is there a simple set of instructions somewhere one can follow to get a
> Spark EC2 cluster reading LZO-compressed input files correctly?
>
> Nick
> 
>
>
> On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Ah, indeed it looks like I need to install this separately
>> <https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1>
>> as it is not part of the core.
>>
>> Nick
>>
>>
>>
>> On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh <
>> gurvinder.si...@uninett.no> wrote:
>>
>>> On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
>>> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
>>> > <gurvinder.si...@uninett.no <mailto:gurvinder.si...@uninett.no>>
>>> wrote:
>>> >
>>> >     csv =
>>> >
>>> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
>>> >
>>> > Does anyone know what the rough equivalent of this would be in the
>>> Scala
>>> > API?
>>> >
>>> I am not sure, I haven't tested it using scala.
>>> com.hadoop.mapreduce.LzoTextInputFormat class is from this package
>>> https://github.com/twitter/hadoop-lzo
>>>
>>> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
>>> debian package on all of my workers. Make sure you have hadoop-lzo.jar
>>> in your class path for spark.
>>>
>>> - Gurvinder
>>>
>>> > I am trying the following, but the first import yields an error on my
>>> > |spark-ec2| cluster:
>>> >
>>> > |import com.hadoop.mapreduce.LzoTextInputFormat
>>> > import org.apache.hadoop.io.LongWritable
>>> > import org.apache.hadoop.io.Text
>>> >
>>> >
>>> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>>> LzoTextInputFormat, LongWritable, Text)
>>> > |
>>> >
>>> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat
>>> > <console>:12: error: object hadoop is not a member of package com
>>> >        import com.hadoop.mapreduce.LzoTextInputFormat
>>> > |
>>> >
>>> > Nick
>>> >
>>> > 
>>>
>>>
>>>
>>
>

Re: reading compress lzo files

Reply via email to