Hello everyone,
This is my update on what I have been doing this last week:
Created an XMLInputFormat java class with the functionalities that Hamza
described in the issue [1] .The class reads from blocks located in HDFS
and returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xml files
of various sizes, the smallest being a single file of 400 MB and the
largest a collection of 5 files totalling 6.1 GB.
This week I will create another implementation of the XMLInputFormat
with a different way of reading and delivering files, the way I
described in the same issue and I will test both solutions in a
standalone and a small hadoop cluster (5-6 nodes).
You can see this week's results here [2] .I will keep updating this file
about the other tests.
Best regards,
Efi
[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing