Re: Map/Reduce with XML files ..

2008-04-28 Thread Kayla Jay
Yes, I'm talking about a collection of small xml files stored in "container" 
files.  I.e there's a lot and lots of small xml files collected into big files. 
 Not one gargantuan XML file. How would you go about using hadoop with splits 
and processing and handling these sorts of XML files?


- Original Message 
From: Ted Dunning <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Monday, April 28, 2008 4:16:20 PM
Subject: Re: Map/Reduce with XML files ..


The only real problem with xml and map-reduce is if you are talking about
one gargantuan XML file.  That makes correct splitting difficult.

If you are talking about millions or billions of small xml files (stored in
some sort of container file), then hadoop should be pretty easy to use.


On 4/28/08 9:39 AM, "Kayla Jay" <[EMAIL PROTECTED]> wrote:

> Hello
> 
> Has anyone had any experience with processing xml files within Hadoop within
> their maps/reduces?
> In particular, has anyone used any sort of XQuery/XPath processing within
> their maps/reduces?
> Say I have XML string passed to the map and now I want to find something in
> particular via XQuery/XPath or some sort to run numbers on occurrences or
> parse out a particular section within the XML.
> 
> Anyone done any XML processing looking for things within XML?  Then, aggregate
> common pieces together in the reduces ?
> 
> 
> On another note,
> Has anyone figured out splits for XML files?
> Has anyone written a custom XML reader other than the StreamXmlRecordReader?
> The only one I've read about and can find anything is:
> http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html
> 
> 
> Thanks.
> 
> 
> 
>  
> __
> __
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Re: Map/Reduce with XML files ..

2008-04-28 Thread Ted Dunning

The only real problem with xml and map-reduce is if you are talking about
one gargantuan XML file.  That makes correct splitting difficult.

If you are talking about millions or billions of small xml files (stored in
some sort of container file), then hadoop should be pretty easy to use.


On 4/28/08 9:39 AM, "Kayla Jay" <[EMAIL PROTECTED]> wrote:

> Hello
> 
> Has anyone had any experience with processing xml files within Hadoop within
> their maps/reduces?
> In particular, has anyone used any sort of XQuery/XPath processing within
> their maps/reduces?
> Say I have XML string passed to the map and now I want to find something in
> particular via XQuery/XPath or some sort to run numbers on occurrences or
> parse out a particular section within the XML.
> 
> Anyone done any XML processing looking for things within XML?  Then, aggregate
> common pieces together in the reduces ?
> 
> 
> On another note,
> Has anyone figured out splits for XML files?
> Has anyone written a custom XML reader other than the StreamXmlRecordReader?
> The only one I've read about and can find anything is:
> http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html
> 
> 
> Thanks.
> 
> 
> 
>   
> __
> __
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ



Map/Reduce with XML files ..

2008-04-28 Thread Kayla Jay
Hello

Has anyone had any experience with processing xml files within Hadoop within 
their maps/reduces?  
In particular, has anyone used any sort of XQuery/XPath processing within their 
maps/reduces?  
Say I have XML string passed to the map and now I want to find something in 
particular via XQuery/XPath or some sort to run numbers on occurrences or parse 
out a particular section within the XML.

Anyone done any XML processing looking for things within XML?  Then, aggregate 
common pieces together in the reduces ?


On another note,
Has anyone figured out splits for XML files?  
Has anyone written a custom XML reader other than the StreamXmlRecordReader?  
The only one I've read about and can find anything is:
http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html


Thanks.



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Re: Best practices for handling many small files

2008-04-28 Thread Doug Cutting

Joydeep Sen Sarma wrote:

There seems to be two problems with small files:
1. namenode overhead. (3307 seems like _a_ solution)
2. map-reduce processing overhead and locality 


It's not clear from 3307 description, how the archives interface with
map-reduce. How are the splits done? Will they solve problem #2?


Yes, I think 3307 will address (2).  Many small files will be packed 
into fewer larger files, each file typically substantially larger than a 
block.  A splitter can read the index files and then use 
MultiFileInputFormat, so that each split could contain files that are 
contained almost entirely in a single block.


Good MapReduce performance is a requirement for the design of 3307.

Doug


Re: Distributed indexing

2008-04-28 Thread Ted Dunning

Check out the bailey and katta projects on sourceforge.

Also take a look at Nutch.

Hadoop is certainly good for indexing and it isn't that hard to put
distributed search alongside hadoop with indexes being pulled from HDFS to
local storage or RAM for speed.


On 4/28/08 7:50 AM, "Matt Wood" <[EMAIL PROTECTED]> wrote:

> Hello all,
> 
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
> 
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an
> index.
> 
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
> 
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be  much appreciated!
> 
> If you would like any more details, I'd be more than happy to supply
> them!
> 
> Many thanks,
> 
> ~ Matt
> 
> 
> -
> 
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
> 
> 



Distributed indexing

2008-04-28 Thread Matt Wood

Hello all,

I was wondering if someone in the know could tell me about the current  
state of play with building and searching large indices with hadoop?


Some background: I work on the human genome project, and we're  
currently setting up a new facility based around the next generation  
of DNA sequencing. We're currently producing around 50Tb of data a  
week, some of which we would like to provide fast access to via an  
index.


Having read up on hadoop, it appears that it could play a central part  
in our infrastructure, and that others have tried (and succeeded) in  
building a distributed indexing and retrieval system with hadoop. I'd  
be interested if anyone could point me in the right direction to more  
information or examples of such a system. Yahoo! (with webmap) seems  
to be close to the sort of thing we would need.


Would map/reduce be a suitable approach for indexing _and_ retrieval,  
or just indexing? Would Solr/Lucene be a good fit? Any help or  
pointers to more information would be  much appreciated!


If you would like any more details, I'd be more than happy to supply  
them!


Many thanks,

~ Matt


-

Matt Wood
Sequencing Informatics // Production Software
www.sanger.ac.uk



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 


subscribe me please

2008-04-28 Thread arash tnt
I like to be subscribe.

I have some problem with hadoop .how can I get proper answers please.

regard.

   
-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.