Hi Joydeep, Ted, Bob, Doug and all Hadoopers,
thanks for the suggestions & ideas. Here is a little update on what we have done in the meantime/the previous weeks(s): We have basically followed your suggestions/ideas and redesigned our system/directory structure the following way: We are using a (dynamic) hash to partition the data into 50 chunks/directories now (and got rid of millions of directories as it was the case before...) Unfortunately Hadoop does not yet support atomic appends :-( As workaround for this, we are using the copy&Merge function in order to consolidate all small chunks/files that arrive from our crawlers in a random and concurrent fashion. In our first approach/version, each crawler was responsible to copy&Merge preexisting data with its own chunk (if none of the other crawlers was doing the same job (hash) simultaneously). The major drawback of this solution was excessive traffic (between the DFS cluster and the crawlers) since copy&Merge is implemented on client side rather than on server/cluster side :-( In order to circumvent this problem, we have established a daemon that is located close to the cluster which performs the consolidation task periodically. By consolidating the small files to 64 MB chunks, we have reached a pretty good avg. MapRed performance of 10GB/20min (~8MB/s) We were also thinking of other approaches like the following: Each crawler opens its own stream and does not close it until it exceeds 64 MB or so... unfortunately, if a crawler crashes without having its stream closed properly, all data which was put in the stream before is lost. Also, we have noticed, that the flush method does not really flush the data onto the DFS... New ideas and suggestions are always welcome. Thanks in advanced and have a great weekend!

Cu on the 'net,
                       Bye - bye,

                                  <<<<< André <<<< >>>> èrbnA >>>>>

Joydeep Sen Sarma wrote:
Would it help if the multifileinputformat bundled files into splits based on 
their location? (wondering if remote copy speed is a bottleneck in map)
If you are going to access the files many times after they are generated - 
writing a job to bundle data once upfront may be worthwhile.
Ted Dunning wrote:
I think that would help some, but the real problem for high performance is
disorganized behavior of the disk head.  If the MFIFormat could organize
files according to disk location as well and avoid successive file opens,
you might be OK, but that is asking for the moon.
Bob Futrelle wrote:
In some cases, it should be possible to concatenate many files into one.
If necessary, a unique byte sequence could be inserted at each "seam".

 - Bob Futrelle
Doug Cutting wrote:
Instead of organizing output into many directories you might consider using keys which encode that directory structure. Then mapreduce can use these to partition output. If you wish to mine only a subset of your data, you can process just those partitions which contain the portions of the keyspace you're interested in.

Doug
Ted Dunning wrote:
A fairly standard approach in a problem like this is for each of the
crawlers to write multiple items to separate files (one per crawler).  This
can be done using a map-reduce structure by having the map input be items to
crawl and then using the reduce mechanism to gather like items together.
This structure can give you the desired output structure, especially if the
reduces are careful about naming their outputs well.

Reply via email to