Hello everyone,

For this week the two different methods for reading complete items according to a specific tag are completed and tested in standalone hdfs deployment.In detail what each method does:

The first method, I call it One Buffer Method, reads a block, saves it in a buffer, and continues reading from the other blocks until it finds a specific closing tag.It shows good results and good times in the tests.

The second method, called Shared File Method, reads only the complete items contained in the block and the incomplete items from the start and end of the block are send to a shared file in the hdfs Distributed Cache. Now this method could work only for relatively small inputs, since the Distributed Cache is limited and in the case of hundreds/thousands of blocks the shared file can exceed the limit.

I took the liberty of creating diagrams that show in example what each method does.
[1] One Buffer Method
[2] Shared File Method

Every insight and feedback is more than welcome about these two methods.In my opinion the One Buffer method is simpler and more effective since it can be used for both small and large datasets.

There is also a question, can the parser work on data that are missing some tags?For example the first and last tag of the xml file that are located in different blocks.

Best regards,
Efi

[1] https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing

[2] https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing



On 05/19/2015 12:43 AM, Michael Carey wrote:
+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:
Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi <efika...@gmail.com> wrote:

Hello everyone,

This is my update on what I have been doing this last week:

Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and
returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a
collection of 5 files totalling 6.1 GB.

This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the
same issue and I will test both solutions in a standalone and a small
hadoop cluster (5-6 nodes).

You can see this week's results here [2] .I will keep updating this file
about the other tests.

Best regards,
Efi

[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing





Reply via email to