Re: [#131]Supporting Hadoop data and cluster management

Till Westmann Thu, 21 May 2015 20:53:49 -0700

(1) I agree that [1] looks better (thanks for the diagrams - we should add them 
to the docs!).
(2) I think that it’s ok to have the restriction, that the given tag
     (a) identifies the root element of the elements that we want to work with 
and
     (b) is not used recursively (and I would check this condition and fail if 
it doesn’t hold).


If we have a few really big nodes in the file, we anyway do not have a way to 
process them in parallel, so the chosen tags should split the document into a 
large number of smaller pieces for VXQuery to work well. 

Wrt. to the question what happens if we start reading a block that does not 
contain the tag(s) in question (I think that that’s the last question - please 
correct me if I’m wrong) it would probably be read without producing any nodes 
that will be processed by the query engine. So the effort to do that would be 
wasted, but I would expect that the block would then be parsed again as the 
continuation of another block that contained a start tag. 

Till

> On May 21, 2015, at 2:59 PM, Steven Jacobs <[email protected]> wrote:
> 
> This seems correct to me. Since our objective in implementing HDFS is to
> deal with very large XML files, I think we should avoid any size
> limitations. Regarding the tags, does anyone have any thoughts on this? In
> the case of searching for all elements with a given name regardless of
> depth, this method will work fine, but if we want a specific path, we could
> end up opening lots of Blocks to guarantee path correctness, the entire
> file in fact.
> Steven
> 
> On Thu, May 21, 2015 at 10:20 AM, Efi <[email protected]> wrote:
> 
>> Hello everyone,
>> 
>> For this week the two different methods for reading complete items
>> according to a specific tag are completed and tested in standalone hdfs
>> deployment.In detail what each method does:
>> 
>> The first method, I call it One Buffer Method, reads a block, saves it in
>> a buffer, and continues reading from the other blocks until it finds a
>> specific closing tag.It shows good results and good times in the tests.
>> 
>> The second method, called Shared File Method, reads only the complete
>> items contained in the block and the incomplete items from the start and
>> end of the block are send to a shared file in the hdfs Distributed Cache.
>> Now this method could work only for relatively small inputs, since the
>> Distributed Cache is limited and in the case of hundreds/thousands of
>> blocks the shared file can exceed the limit.
>> 
>> I took the liberty of creating diagrams that show in example what each
>> method does.
>> [1] One Buffer Method
>> [2] Shared File Method
>> 
>> Every insight and feedback is more than welcome about these two methods.In
>> my opinion the One Buffer method is simpler and more effective since it can
>> be used for both small and large datasets.
>> 
>> There is also a question, can the parser work on data that are missing
>> some tags?For example the first and last tag of the xml file that are
>> located in different blocks.
>> 
>> Best regards,
>> Efi
>> 
>> [1]
>> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
>> 
>> [2]
>> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
>> 
>> 
>> 
>> 
>> On 05/19/2015 12:43 AM, Michael Carey wrote:
>> 
>>> +1 Sounds great!
>>> 
>>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>> 
>>>> Great work!
>>>> Steven
>>>> 
>>>> On Sun, May 17, 2015 at 1:15 PM, Efi <[email protected]> wrote:
>>>> 
>>>> Hello everyone,
>>>>> 
>>>>> This is my update on what I have been doing this last week:
>>>>> 
>>>>> Created an XMLInputFormat java class with the functionalities that Hamza
>>>>> described in the issue [1] .The class reads from blocks located in HDFS
>>>>> and
>>>>> returns complete items according to a specified xml tag.
>>>>> I also tested this class in a standalone hadoop cluster with xml files
>>>>> of
>>>>> various sizes, the smallest being a single file of 400 MB and the
>>>>> largest a
>>>>> collection of 5 files totalling 6.1 GB.
>>>>> 
>>>>> This week I will create another implementation of the XMLInputFormat
>>>>> with
>>>>> a different way of reading and delivering files, the way I described in
>>>>> the
>>>>> same issue and I will test both solutions in a standalone and a small
>>>>> hadoop cluster (5-6 nodes).
>>>>> 
>>>>> You can see this week's results here [2] .I will keep updating this file
>>>>> about the other tests.
>>>>> 
>>>>> Best regards,
>>>>> Efi
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>>> [2]
>>>>> 
>>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: [#131]Supporting Hadoop data and cluster management

Reply via email to