Re: [#131]Supporting Hadoop data and cluster management

Efi Thu, 21 May 2015 10:22:06 -0700

Hello everyone,

For this week the two different methods for reading complete itemsaccording to a specific tag are completed and tested in standalone hdfsdeployment.In detail what each method does:

The first method, I call it One Buffer Method, reads a block, saves itin a buffer, and continues reading from the other blocks until it findsa specific closing tag.It shows good results and good times in the tests.

The second method, called Shared File Method, reads only the completeitems contained in the block and the incomplete items from the start andend of the block are send to a shared file in the hdfs DistributedCache. Now this method could work only for relatively small inputs,since the Distributed Cache is limited and in the case ofhundreds/thousands of blocks the shared file can exceed the limit.

I took the liberty of creating diagrams that show in example what eachmethod does.

[1] One Buffer Method
[2] Shared File Method

Every insight and feedback is more than welcome about these twomethods.In my opinion the One Buffer method is simpler and moreeffective since it can be used for both small and large datasets.

There is also a question, can the parser work on data that are missingsome tags?For example the first and last tag of the xml file that arelocated in different blocks.


Best regards,
Efi

[1]https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing

[2]https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing




On 05/19/2015 12:43 AM, Michael Carey wrote:

+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:
Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi <efika...@gmail.com> wrote:
Hello everyone,

This is my update on what I have been doing this last week:
Created an XMLInputFormat java class with the functionalities thatHamzadescribed in the issue [1] .The class reads from blocks located inHDFS and
returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xmlfiles ofvarious sizes, the smallest being a single file of 400 MB and thelargest a
collection of 5 files totalling 6.1 GB.
This week I will create another implementation of the XMLInputFormatwitha different way of reading and delivering files, the way I describedin the
same issue and I will test both solutions in a standalone and a small
hadoop cluster (5-6 nodes).
You can see this week's results here [2] .I will keep updating thisfile
about the other tests.

Best regards,
Efi

[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing

Re: [#131]Supporting Hadoop data and cluster management

Reply via email to