Re: Combine previous Map Results

2008-04-25 Thread Dina Said

Thanks Joydeep.
I am sorry for not recognizing that in the first place.

Joydeep Sen Sarma wrote:
> Ummm .. was in the initial reply:
>  
>   
>> u can write a mapper that can decide the map logic based on the input
>> 
> file
>   
>> name (look for the jobconf variable map.input.file in Java - or the 
>> environment variable map_input_file in hadoop streaming).
>> 
>
> -Original Message-
> From: Dina Said [mailto:[EMAIL PROTECTED] 
> Sent: Friday, April 25, 2008 5:42 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Combine previous Map Results
>
> Thanks Ted
>
> But how can I specify that the inputs come from the following files
> should be processed by f_a and the other inputs should be processed by
> f_b?
> Or how can I check the input type?
>
> The input to the map is in the format of inputsplits as far as I know
>
> Dina
>
> Ted Dunning wrote:
>   
>> You can only have one map function.
>>
>> But that function can decide which sort of thing to do based on which
>> 
> input
>   
>> it is given.  That allows input of type A to be processed with map
>> 
> funtion
>   
>> f_a and input of type B to be processed with map function f_b.
>>
>>
>>
>>
>> On 4/25/08 4:43 PM, "Dina Said" <[EMAIL PROTECTED]> wrote:
>>
>>   
>> 
>>> Thanks Joydeep for your reply.
>>>
>>> But is there a possibility to have two or more Map tasks and a single
>>> reduce task?
>>> I want the reduce task to work on all the intermediate keys produced
>>> from these Map tasks.
>>>
>>> I am sorry I am a new baby in Map-Reduce but from my first reading:
>>> I can see that we can define only one Map task
>>>
>>> Thanks
>>> Dina
>>>
>>>
>>> Joydeep Sen Sarma wrote:
>>> 
>>>   
 if one weren't thinking about performance - then the second
 
> map-reduce task
>   
 would have to process both the data sets (the intermediate data and
 
> the new
>   
 data). For the existing intermediate data - you want to do an
 
> identity map
>   
 and for the new data - whatever map logic you have. u can write a
 
> mapper that
>   
 can decide the map logic based on the input file name (look for the
 
> jobconf
>   
 variable map.input.file in Java - or the environment variable
 
> map_input_file
>   
 in hadoop streaming).

 if one were thinking about performance - then one would argue that
 
> re-sorting
>   
 the existing intermediate data (as would happen in the simple
 
> solution) is
>   
 pointless (it's already sorted by the desired key). if this is a
 
> concern -
>   
 the only thing that's available right now (afaik) is a feature
 
> described in
>   
 hadoop-2085. (you would have to map-reduce the new data set only and
 
> then
>   
 join the old and new data using map-side joins described in this
 
> jira - this
>   
 would require a third map-reduce task).


 (one could argue that if there was an option to skip map-side
 
> sorting on a
>   
 per-file level - that would be perfect. one would skip map-side
 
> sorts of the
>   
 old data and only sort the new data - and the reducer would merge
 
> the two).
>   
 -Original Message-
 From: Dina Said [mailto:[EMAIL PROTECTED]
 Sent: Sat 4/19/2008 1:55 PM
 To: core-user@hadoop.apache.org
 Subject: Combine previous Map Results
  
 Dear all

 Suppose that I have files that have intermediate key values and I
 
> want
>   
 to combine these intermediate keys values with a new MapReduce task.
 
> I
>   
 want this MapReduce task to combine during the reduce stage the
 intermediate key values it generates with the intermediate key
 
> values I
>   
 already have.

 Any ideas?

 Dina


   
   
 
>>   
>> 
>
>
>   



UnknownScannerException on Short Job

2008-04-25 Thread jkupferman

Hi Everyone,

I have been having a lot of issues reading from a HBase tables, I keep
getting UnknownScannerException's when I try to iterate over a table. 

I read on this forum that this is a result of leases timing out by mappers
taking too long before calling next(). I noticed however that this issue is
not a result of Mapper itself since even when running a Map function which
only contains a return statement, it still throws the same exception. When I
run the same program, on the same table with a local job tracker, it seems
to work. It does however work distributed on a small table.

Any suggestions on the reasoning for why it works locally but not
distributed?


The following is the exception that is being thrown:
org.apache.hadoop.hbase.UnknownScannerException:
org.apache.hadoop.hbase.UnknownScannerException: -2478484392535468022
at org.apache.hadoop.hbase.HRegionServer.close(HRegionServer.java:1482)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
at org.apache.hadoop.hbase.HTable$ClientScanner.close(HTable.java:1169)
at
org.apache.hadoop.hbase.mapred.TableInputFormat$TableRecordReader.close(TableInputFormat.java:88)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:155)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:212)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)




-- 
View this message in context: 
http://www.nabble.com/UnknownScannerException-on-Short-Job-tp16909679p16909679.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



RE: Combine previous Map Results

2008-04-25 Thread Joydeep Sen Sarma
Ummm .. was in the initial reply:
 
> u can write a mapper that can decide the map logic based on the input
file
> name (look for the jobconf variable map.input.file in Java - or the 
> environment variable map_input_file in hadoop streaming).

-Original Message-
From: Dina Said [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 25, 2008 5:42 PM
To: core-user@hadoop.apache.org
Subject: Re: Combine previous Map Results

Thanks Ted

But how can I specify that the inputs come from the following files
should be processed by f_a and the other inputs should be processed by
f_b?
Or how can I check the input type?

The input to the map is in the format of inputsplits as far as I know

Dina

Ted Dunning wrote:
> You can only have one map function.
>
> But that function can decide which sort of thing to do based on which
input
> it is given.  That allows input of type A to be processed with map
funtion
> f_a and input of type B to be processed with map function f_b.
>
>
>
>
> On 4/25/08 4:43 PM, "Dina Said" <[EMAIL PROTECTED]> wrote:
>
>   
>> Thanks Joydeep for your reply.
>>
>> But is there a possibility to have two or more Map tasks and a single
>> reduce task?
>> I want the reduce task to work on all the intermediate keys produced
>> from these Map tasks.
>>
>> I am sorry I am a new baby in Map-Reduce but from my first reading:
>> I can see that we can define only one Map task
>>
>> Thanks
>> Dina
>>
>>
>> Joydeep Sen Sarma wrote:
>> 
>>> if one weren't thinking about performance - then the second
map-reduce task
>>> would have to process both the data sets (the intermediate data and
the new
>>> data). For the existing intermediate data - you want to do an
identity map
>>> and for the new data - whatever map logic you have. u can write a
mapper that
>>> can decide the map logic based on the input file name (look for the
jobconf
>>> variable map.input.file in Java - or the environment variable
map_input_file
>>> in hadoop streaming).
>>>
>>> if one were thinking about performance - then one would argue that
re-sorting
>>> the existing intermediate data (as would happen in the simple
solution) is
>>> pointless (it's already sorted by the desired key). if this is a
concern -
>>> the only thing that's available right now (afaik) is a feature
described in
>>> hadoop-2085. (you would have to map-reduce the new data set only and
then
>>> join the old and new data using map-side joins described in this
jira - this
>>> would require a third map-reduce task).
>>>
>>>
>>> (one could argue that if there was an option to skip map-side
sorting on a
>>> per-file level - that would be perfect. one would skip map-side
sorts of the
>>> old data and only sort the new data - and the reducer would merge
the two).
>>>
>>>
>>> -Original Message-
>>> From: Dina Said [mailto:[EMAIL PROTECTED]
>>> Sent: Sat 4/19/2008 1:55 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Combine previous Map Results
>>>  
>>> Dear all
>>>
>>> Suppose that I have files that have intermediate key values and I
want
>>> to combine these intermediate keys values with a new MapReduce task.
I
>>> want this MapReduce task to combine during the reduce stage the
>>> intermediate key values it generates with the intermediate key
values I
>>> already have.
>>>
>>> Any ideas?
>>>
>>> Dina
>>>
>>>
>>>   
>>>   
>
>
>   



RE: Best practices for handling many small files

2008-04-25 Thread Joydeep Sen Sarma
There seems to be two problems with small files:
1. namenode overhead. (3307 seems like _a_ solution)
2. map-reduce processing overhead and locality 

It's not clear from 3307 description, how the archives interface with
map-reduce. How are the splits done? Will they solve problem #2?

To some extent, the goals of a archive and the (ideal)
multifileinputformat (MFIF) differ. The archive wants to preserve the
identity of each of the subobjects. MFIF offers a way for users to
homogenize small objects into larger ones (while processing). When a
user wants to use MFIF - they don't care about the identities of each of
the small files.

In the interest of layering - it seems we should keep these separate.
3307 can offer an efficient way to store small files. A good MFIF
implementation can offer a way to efficiently do map-reduce on small
files (whether those small files are a regular hdfs file or are backed
by a archive). This way the user will also have an option to use regular
fileinputformat (say they _do_ care about the file being processed).


-Original Message-
From: Konstantin Shvachko [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 25, 2008 10:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Best practices for handling many small files

Would the new archive feature HADOOP-3307 that is currently being
developed help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307

--Konstantin

Subramaniam Krishnan wrote:
> 
> We have actually written a custom Multi File Splitter that collapses
all 
> the small files to a single split till the DFS Block Size is hit.
> We also take care of handling big files by splitting them on Block
Size 
> and adding up all the reminders(if any) to a single split.
> 
> It works great for us:-)
> We are working on optimizing it further to club all the small files in
a 
> single data node together so that the Map can have maximum local data.
> 
> We plan to share this(provided it's found acceptable, of course) once 
> this is done.
> 
> Regards,
> Subru
> 
> Stuart Sierra wrote:
> 
>> Thanks for the advice, everyone.  I'm going to go with #2, packing my
>> million files into a small number of SequenceFiles.  This is slow,
but
>> only has to be done once.  My "datacenter" is Amazon Web Services :),
>> so storing a few large, compressed files is the easiest way to go.
>>
>> My code, if anyone's interested, is here:
>> http://stuartsierra.com/2008/04/24/a-million-little-files
>>
>> -Stuart
>> altlaw.org
>>
>>
>> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra 
>> <[EMAIL PROTECTED]> wrote:
>>  
>>
>>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>>  handle large (~1 million) collections of small files (10 to 100KB)
in
>>>  which each file is a single "record"?
>>>
>>>  1. Ignore it, let Hadoop create a million Map processes;
>>>  2. Pack all the files into a single SequenceFile; or
>>>  3. Something else?
>>>
>>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will
that
>>>  work?
>>>
>>>  Thanks,
>>>  -Stuart, altlaw.org
>>>
>>> 
> 
> 


Re: Combine previous Map Results

2008-04-25 Thread Dina Said
Thanks Ted

But how can I specify that the inputs come from the following files
should be processed by f_a and the other inputs should be processed by f_b?
Or how can I check the input type?

The input to the map is in the format of inputsplits as far as I know

Dina

Ted Dunning wrote:
> You can only have one map function.
>
> But that function can decide which sort of thing to do based on which input
> it is given.  That allows input of type A to be processed with map funtion
> f_a and input of type B to be processed with map function f_b.
>
>
>
>
> On 4/25/08 4:43 PM, "Dina Said" <[EMAIL PROTECTED]> wrote:
>
>   
>> Thanks Joydeep for your reply.
>>
>> But is there a possibility to have two or more Map tasks and a single
>> reduce task?
>> I want the reduce task to work on all the intermediate keys produced
>> from these Map tasks.
>>
>> I am sorry I am a new baby in Map-Reduce but from my first reading:
>> I can see that we can define only one Map task
>>
>> Thanks
>> Dina
>>
>>
>> Joydeep Sen Sarma wrote:
>> 
>>> if one weren't thinking about performance - then the second map-reduce task
>>> would have to process both the data sets (the intermediate data and the new
>>> data). For the existing intermediate data - you want to do an identity map
>>> and for the new data - whatever map logic you have. u can write a mapper 
>>> that
>>> can decide the map logic based on the input file name (look for the jobconf
>>> variable map.input.file in Java - or the environment variable map_input_file
>>> in hadoop streaming).
>>>
>>> if one were thinking about performance - then one would argue that 
>>> re-sorting
>>> the existing intermediate data (as would happen in the simple solution) is
>>> pointless (it's already sorted by the desired key). if this is a concern -
>>> the only thing that's available right now (afaik) is a feature described in
>>> hadoop-2085. (you would have to map-reduce the new data set only and then
>>> join the old and new data using map-side joins described in this jira - this
>>> would require a third map-reduce task).
>>>
>>>
>>> (one could argue that if there was an option to skip map-side sorting on a
>>> per-file level - that would be perfect. one would skip map-side sorts of the
>>> old data and only sort the new data - and the reducer would merge the two).
>>>
>>>
>>> -Original Message-
>>> From: Dina Said [mailto:[EMAIL PROTECTED]
>>> Sent: Sat 4/19/2008 1:55 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Combine previous Map Results
>>>  
>>> Dear all
>>>
>>> Suppose that I have files that have intermediate key values and I want
>>> to combine these intermediate keys values with a new MapReduce task. I
>>> want this MapReduce task to combine during the reduce stage the
>>> intermediate key values it generates with the intermediate key values I
>>> already have.
>>>
>>> Any ideas?
>>>
>>> Dina
>>>
>>>
>>>   
>>>   
>
>
>   



Re: Combine previous Map Results

2008-04-25 Thread Ted Dunning


You can only have one map function.

But that function can decide which sort of thing to do based on which input
it is given.  That allows input of type A to be processed with map funtion
f_a and input of type B to be processed with map function f_b.




On 4/25/08 4:43 PM, "Dina Said" <[EMAIL PROTECTED]> wrote:

> Thanks Joydeep for your reply.
> 
> But is there a possibility to have two or more Map tasks and a single
> reduce task?
> I want the reduce task to work on all the intermediate keys produced
> from these Map tasks.
> 
> I am sorry I am a new baby in Map-Reduce but from my first reading:
> I can see that we can define only one Map task
> 
> Thanks
> Dina
> 
> 
> Joydeep Sen Sarma wrote:
>> if one weren't thinking about performance - then the second map-reduce task
>> would have to process both the data sets (the intermediate data and the new
>> data). For the existing intermediate data - you want to do an identity map
>> and for the new data - whatever map logic you have. u can write a mapper that
>> can decide the map logic based on the input file name (look for the jobconf
>> variable map.input.file in Java - or the environment variable map_input_file
>> in hadoop streaming).
>> 
>> if one were thinking about performance - then one would argue that re-sorting
>> the existing intermediate data (as would happen in the simple solution) is
>> pointless (it's already sorted by the desired key). if this is a concern -
>> the only thing that's available right now (afaik) is a feature described in
>> hadoop-2085. (you would have to map-reduce the new data set only and then
>> join the old and new data using map-side joins described in this jira - this
>> would require a third map-reduce task).
>> 
>> 
>> (one could argue that if there was an option to skip map-side sorting on a
>> per-file level - that would be perfect. one would skip map-side sorts of the
>> old data and only sort the new data - and the reducer would merge the two).
>> 
>> 
>> -Original Message-
>> From: Dina Said [mailto:[EMAIL PROTECTED]
>> Sent: Sat 4/19/2008 1:55 PM
>> To: core-user@hadoop.apache.org
>> Subject: Combine previous Map Results
>>  
>> Dear all
>> 
>> Suppose that I have files that have intermediate key values and I want
>> to combine these intermediate keys values with a new MapReduce task. I
>> want this MapReduce task to combine during the reduce stage the
>> intermediate key values it generates with the intermediate key values I
>> already have.
>> 
>> Any ideas?
>> 
>> Dina
>> 
>> 
>>   
> 



Re: Combine previous Map Results

2008-04-25 Thread Dina Said
Thanks Joydeep for your reply.

But is there a possibility to have two or more Map tasks and a single
reduce task?
I want the reduce task to work on all the intermediate keys produced
from these Map tasks.

I am sorry I am a new baby in Map-Reduce but from my first reading:
I can see that we can define only one Map task

Thanks
Dina


Joydeep Sen Sarma wrote:
> if one weren't thinking about performance - then the second map-reduce task 
> would have to process both the data sets (the intermediate data and the new 
> data). For the existing intermediate data - you want to do an identity map 
> and for the new data - whatever map logic you have. u can write a mapper that 
> can decide the map logic based on the input file name (look for the jobconf 
> variable map.input.file in Java - or the environment variable map_input_file 
> in hadoop streaming).
>
> if one were thinking about performance - then one would argue that re-sorting 
> the existing intermediate data (as would happen in the simple solution) is 
> pointless (it's already sorted by the desired key). if this is a concern - 
> the only thing that's available right now (afaik) is a feature described in 
> hadoop-2085. (you would have to map-reduce the new data set only and then 
> join the old and new data using map-side joins described in this jira - this 
> would require a third map-reduce task).
>
>
> (one could argue that if there was an option to skip map-side sorting on a 
> per-file level - that would be perfect. one would skip map-side sorts of the 
> old data and only sort the new data - and the reducer would merge the two).
>
>
> -Original Message-
> From: Dina Said [mailto:[EMAIL PROTECTED]
> Sent: Sat 4/19/2008 1:55 PM
> To: core-user@hadoop.apache.org
> Subject: Combine previous Map Results
>  
> Dear all
>
> Suppose that I have files that have intermediate key values and I want
> to combine these intermediate keys values with a new MapReduce task. I
> want this MapReduce task to combine during the reduce stage the
> intermediate key values it generates with the intermediate key values I
> already have.
>
> Any ideas?
>
> Dina
>
>
>   



Re: Hadoop User Group (UK)

2008-04-25 Thread Lukas Vlcek
Hi,

Is there any plan to record any video and make it accessible to the rest of
the world?

Regards,
Lukas

On Fri, Apr 25, 2008 at 5:29 PM, Johan Oskarsson <[EMAIL PROTECTED]> wrote:

> August 19th brings the first of many Hadoop User Group meetups in the UK.
> It will be hosted somewhere in London and we'll have presentations from
> both developers and users of Apache Hadoop.
>
> The event is free and anyone is welcome.
> Please help us by adding yourself as attending if you're coming:
> http://upcoming.yahoo.com/event/506444
>
> If you're interested in presenting please let us know at [EMAIL PROTECTED]
>
> Preliminary speakers:
> Doug Cutting (Yahoo!) - Hadoop overview
> Tom White (Lexemetech) - Hadoop on Amazon S3/EC2
> Steve Loughran and Julio Guijarro (HP) - Smartfrog and Hadoop
> Martin Dittus and Johan Oskarsson (Last.fm) - Hadoop usage at Last.fm
>
>
> More details, presenters and venue announced at a later date. Keep an eye
> on the upcoming event page.
>



-- 
http://blog.lukas-vlcek.com/


Re: Best practices for handling many small files

2008-04-25 Thread Konstantin Shvachko

Would the new archive feature HADOOP-3307 that is currently being developed 
help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307

--Konstantin

Subramaniam Krishnan wrote:


We have actually written a custom Multi File Splitter that collapses all 
the small files to a single split till the DFS Block Size is hit.
We also take care of handling big files by splitting them on Block Size 
and adding up all the reminders(if any) to a single split.


It works great for us:-)
We are working on optimizing it further to club all the small files in a 
single data node together so that the Map can have maximum local data.


We plan to share this(provided it's found acceptable, of course) once 
this is done.


Regards,
Subru

Stuart Sierra wrote:


Thanks for the advice, everyone.  I'm going to go with #2, packing my
million files into a small number of SequenceFiles.  This is slow, but
only has to be done once.  My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.

My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files

-Stuart
altlaw.org


On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra 
<[EMAIL PROTECTED]> wrote:
 


Hello all, Hadoop newbie here, asking: what's the preferred way to
 handle large (~1 million) collections of small files (10 to 100KB) in
 which each file is a single "record"?

 1. Ignore it, let Hadoop create a million Map processes;
 2. Pack all the files into a single SequenceFile; or
 3. Something else?

 I started writing code to do #2, transforming a big tar.bz2 into a
 BLOCK-compressed SequenceFile, with the file names as keys.  Will that
 work?

 Thanks,
 -Stuart, altlaw.org







Hadoop User Group (UK)

2008-04-25 Thread Johan Oskarsson

August 19th brings the first of many Hadoop User Group meetups in the UK.
It will be hosted somewhere in London and we'll have presentations from
both developers and users of Apache Hadoop.

The event is free and anyone is welcome.
Please help us by adding yourself as attending if you're coming: 
http://upcoming.yahoo.com/event/506444


If you're interested in presenting please let us know at [EMAIL PROTECTED]

Preliminary speakers:
Doug Cutting (Yahoo!) - Hadoop overview
Tom White (Lexemetech) - Hadoop on Amazon S3/EC2
Steve Loughran and Julio Guijarro (HP) - Smartfrog and Hadoop
Martin Dittus and Johan Oskarsson (Last.fm) - Hadoop usage at Last.fm


More details, presenters and venue announced at a later date. Keep an 
eye on the upcoming event page.


Re: Best practices for handling many small files

2008-04-25 Thread Subramaniam Krishnan


We have actually written a custom Multi File Splitter that collapses all 
the small files to a single split till the DFS Block Size is hit.
We also take care of handling big files by splitting them on Block Size 
and adding up all the reminders(if any) to a single split.


It works great for us:-)
We are working on optimizing it further to club all the small files in a 
single data node together so that the Map can have maximum local data.


We plan to share this(provided it's found acceptable, of course) once 
this is done.


Regards,
Subru

Stuart Sierra wrote:

Thanks for the advice, everyone.  I'm going to go with #2, packing my
million files into a small number of SequenceFiles.  This is slow, but
only has to be done once.  My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.

My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files

-Stuart
altlaw.org


On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
  

Hello all, Hadoop newbie here, asking: what's the preferred way to
 handle large (~1 million) collections of small files (10 to 100KB) in
 which each file is a single "record"?

 1. Ignore it, let Hadoop create a million Map processes;
 2. Pack all the files into a single SequenceFile; or
 3. Something else?

 I started writing code to do #2, transforming a big tar.bz2 into a
 BLOCK-compressed SequenceFile, with the file names as keys.  Will that
 work?

 Thanks,
 -Stuart, altlaw.org