On 22 May 2015, at 3:26, Efi wrote:
Thank you for the recursively tag check, Steven told me about it
yesterday as well.I hadnt thought of it so far but I will think of
ways to implement it for these methods so it does not create problems.
My question was not exactly that, I was considering if the query
engine could parse data that have complete elements but miss other
tags from greater elements.
For example, one data that comes from either of these methods can look
like this:
<books>
....
<book>
...
</book>
And another one like this:
<book>
....
</book>
...
</books>
The query is about data inside the element book, will these work with
the query engine?
I would hope so. I assume, that everything before the fist <book> and
between a </book> and the next <book> should be ignored. And everything
between a <book> and a </book> is probably parsed and passed to the
query engine.
Does that make sense?
About your answer for the scenario where a block does not contain the
tags in question, it can mean two things.It is not part of the element
we want to work with,so we simply ignore it, or it is part of the
element but the starting and ending tags are in previous/next blocks.
So this block contains only part of the body that we want.In that case
it will be parsed only by the readers that are assigned to read the
block that contains the starting tag of this element.
Yes, that sounds right.
On that note, I am currently working on a way to assign only one
reader to each block, because hdfs assigns readers according to the
available cores of the CPUs you use.That means the same block can be
assigned to more than one readers and in our case that can lead to
memory problems.
I'm not sure I fully understand the current design. Could you explain in
a little more detail in which case you see which problem coming up (I
can imagine a number of problems with memory ...)?
Cheers,
Till
On 22/05/2015 06:53 πμ, Till Westmann wrote:
(1) I agree that [1] looks better (thanks for the diagrams - we
should add them to the docs!).
(2) I think that it’s ok to have the restriction, that the given
tag
(a) identifies the root element of the elements that we want to
work with and
(b) is not used recursively (and I would check this condition and
fail if it doesn’t hold).
If we have a few really big nodes in the file, we anyway do not have
a way to process them in parallel, so the chosen tags should split
the document into a large number of smaller pieces for VXQuery to
work well.
Wrt. to the question what happens if we start reading a block that
does not contain the tag(s) in question (I think that that’s the
last question - please correct me if I’m wrong) it would probably
be read without producing any nodes that will be processed by the
query engine. So the effort to do that would be wasted, but I would
expect that the block would then be parsed again as the continuation
of another block that contained a start tag.
Till
On May 21, 2015, at 2:59 PM, Steven Jacobs <[email protected]> wrote:
This seems correct to me. Since our objective in implementing HDFS
is to
deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts on
this? In
the case of searching for all elements with a given name regardless
of
depth, this method will work fine, but if we want a specific path,
we could
end up opening lots of Blocks to guarantee path correctness, the
entire
file in fact.
Steven
On Thu, May 21, 2015 at 10:20 AM, Efi <[email protected]> wrote:
Hello everyone,
For this week the two different methods for reading complete items
according to a specific tag are completed and tested in standalone
hdfs
deployment.In detail what each method does:
The first method, I call it One Buffer Method, reads a block, saves
it in
a buffer, and continues reading from the other blocks until it
finds a
specific closing tag.It shows good results and good times in the
tests.
The second method, called Shared File Method, reads only the
complete
items contained in the block and the incomplete items from the
start and
end of the block are send to a shared file in the hdfs Distributed
Cache.
Now this method could work only for relatively small inputs, since
the
Distributed Cache is limited and in the case of hundreds/thousands
of
blocks the shared file can exceed the limit.
I took the liberty of creating diagrams that show in example what
each
method does.
[1] One Buffer Method
[2] Shared File Method
Every insight and feedback is more than welcome about these two
methods.In
my opinion the One Buffer method is simpler and more effective
since it can
be used for both small and large datasets.
There is also a question, can the parser work on data that are
missing
some tags?For example the first and last tag of the xml file that
are
located in different blocks.
Best regards,
Efi
[1]
https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
[2]
https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
On 05/19/2015 12:43 AM, Michael Carey wrote:
+1 Sounds great!
On 5/18/15 8:33 AM, Steven Jacobs wrote:
Great work!
Steven
On Sun, May 17, 2015 at 1:15 PM, Efi <[email protected]> wrote:
Hello everyone,
This is my update on what I have been doing this last week:
Created an XMLInputFormat java class with the functionalities
that Hamza
described in the issue [1] .The class reads from blocks located
in HDFS
and
returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xml
files
of
various sizes, the smallest being a single file of 400 MB and
the
largest a
collection of 5 files totalling 6.1 GB.
This week I will create another implementation of the
XMLInputFormat
with
a different way of reading and delivering files, the way I
described in
the
same issue and I will test both solutions in a standalone and a
small
hadoop cluster (5-6 nodes).
You can see this week's results here [2] .I will keep updating
this file
about the other tests.
Best regards,
Efi
[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing