This seems correct to me. Since our objective in implementing HDFS is to deal with very large XML files, I think we should avoid any size limitations. Regarding the tags, does anyone have any thoughts on this? In the case of searching for all elements with a given name regardless of depth, this method will work fine, but if we want a specific path, we could end up opening lots of Blocks to guarantee path correctness, the entire file in fact. Steven
On Thu, May 21, 2015 at 10:20 AM, Efi <efika...@gmail.com> wrote: > Hello everyone, > > For this week the two different methods for reading complete items > according to a specific tag are completed and tested in standalone hdfs > deployment.In detail what each method does: > > The first method, I call it One Buffer Method, reads a block, saves it in > a buffer, and continues reading from the other blocks until it finds a > specific closing tag.It shows good results and good times in the tests. > > The second method, called Shared File Method, reads only the complete > items contained in the block and the incomplete items from the start and > end of the block are send to a shared file in the hdfs Distributed Cache. > Now this method could work only for relatively small inputs, since the > Distributed Cache is limited and in the case of hundreds/thousands of > blocks the shared file can exceed the limit. > > I took the liberty of creating diagrams that show in example what each > method does. > [1] One Buffer Method > [2] Shared File Method > > Every insight and feedback is more than welcome about these two methods.In > my opinion the One Buffer method is simpler and more effective since it can > be used for both small and large datasets. > > There is also a question, can the parser work on data that are missing > some tags?For example the first and last tag of the xml file that are > located in different blocks. > > Best regards, > Efi > > [1] > https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing > > [2] > https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing > > > > > On 05/19/2015 12:43 AM, Michael Carey wrote: > >> +1 Sounds great! >> >> On 5/18/15 8:33 AM, Steven Jacobs wrote: >> >>> Great work! >>> Steven >>> >>> On Sun, May 17, 2015 at 1:15 PM, Efi <efika...@gmail.com> wrote: >>> >>> Hello everyone, >>>> >>>> This is my update on what I have been doing this last week: >>>> >>>> Created an XMLInputFormat java class with the functionalities that Hamza >>>> described in the issue [1] .The class reads from blocks located in HDFS >>>> and >>>> returns complete items according to a specified xml tag. >>>> I also tested this class in a standalone hadoop cluster with xml files >>>> of >>>> various sizes, the smallest being a single file of 400 MB and the >>>> largest a >>>> collection of 5 files totalling 6.1 GB. >>>> >>>> This week I will create another implementation of the XMLInputFormat >>>> with >>>> a different way of reading and delivering files, the way I described in >>>> the >>>> same issue and I will test both solutions in a standalone and a small >>>> hadoop cluster (5-6 nodes). >>>> >>>> You can see this week's results here [2] .I will keep updating this file >>>> about the other tests. >>>> >>>> Best regards, >>>> Efi >>>> >>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131 >>>> [2] >>>> >>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing >>>> >>>> >>>> >> >> >