This seems correct to me. Since our objective in implementing HDFS is to
deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts on this? In
the case of searching for all elements with a given name regardless of
depth, this method will work fine, but if we want a specific path, we could
end up opening lots of Blocks to guarantee path correctness, the entire
file in fact.
Steven

On Thu, May 21, 2015 at 10:20 AM, Efi <efika...@gmail.com> wrote:

> Hello everyone,
>
> For this week the two different methods for reading complete items
> according to a specific tag are completed and tested in standalone hdfs
> deployment.In detail what each method does:
>
> The first method, I call it One Buffer Method, reads a block, saves it in
> a buffer, and continues reading from the other blocks until it finds a
> specific closing tag.It shows good results and good times in the tests.
>
> The second method, called Shared File Method, reads only the complete
> items contained in the block and the incomplete items from the start and
> end of the block are send to a shared file in the hdfs Distributed Cache.
> Now this method could work only for relatively small inputs, since the
> Distributed Cache is limited and in the case of hundreds/thousands of
> blocks the shared file can exceed the limit.
>
> I took the liberty of creating diagrams that show in example what each
> method does.
> [1] One Buffer Method
> [2] Shared File Method
>
> Every insight and feedback is more than welcome about these two methods.In
> my opinion the One Buffer method is simpler and more effective since it can
> be used for both small and large datasets.
>
> There is also a question, can the parser work on data that are missing
> some tags?For example the first and last tag of the xml file that are
> located in different blocks.
>
> Best regards,
> Efi
>
> [1]
> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
>
> [2]
> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
>
>
>
>
> On 05/19/2015 12:43 AM, Michael Carey wrote:
>
>> +1 Sounds great!
>>
>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>
>>> Great work!
>>> Steven
>>>
>>> On Sun, May 17, 2015 at 1:15 PM, Efi <efika...@gmail.com> wrote:
>>>
>>>  Hello everyone,
>>>>
>>>> This is my update on what I have been doing this last week:
>>>>
>>>> Created an XMLInputFormat java class with the functionalities that Hamza
>>>> described in the issue [1] .The class reads from blocks located in HDFS
>>>> and
>>>> returns complete items according to a specified xml tag.
>>>> I also tested this class in a standalone hadoop cluster with xml files
>>>> of
>>>> various sizes, the smallest being a single file of 400 MB and the
>>>> largest a
>>>> collection of 5 files totalling 6.1 GB.
>>>>
>>>> This week I will create another implementation of the XMLInputFormat
>>>> with
>>>> a different way of reading and delivering files, the way I described in
>>>> the
>>>> same issue and I will test both solutions in a standalone and a small
>>>> hadoop cluster (5-6 nodes).
>>>>
>>>> You can see this week's results here [2] .I will keep updating this file
>>>> about the other tests.
>>>>
>>>> Best regards,
>>>> Efi
>>>>
>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>> [2]
>>>>
>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>>>
>>>>
>>>>
>>
>>
>

Reply via email to