Re: [#131] Supporting Hadoop data and cluster management

2015-05-24 Thread Till Westmann


On 22 May 2015, at 3:26, Efi wrote:

Thank you for the recursively tag check, Steven told me about it 
yesterday as well.I hadnt thought of it so far but I will think of 
ways to implement it for these methods so it does not create problems.


My question was not exactly that, I was considering if the query 
engine could parse data that have complete elements but miss other 
tags from greater elements.
For example, one data that comes from either of these methods can look 
like this:


books

book
...
/book

And another one like this:

book

/book
...
/books

The query is about data inside the element book, will these work with 
the query engine?


I would hope so. I assume, that everything before the fist book and 
between a /book and the next book should be ignored. And everything 
between a book and a /book is probably parsed and passed to the 
query engine.

Does that make sense?

About your answer for the scenario where a block does not contain the 
tags in question, it can mean two things.It is not part of the element 
we want to work with,so we simply ignore it, or it is part of the 
element but the starting and ending tags are in previous/next blocks. 
So this block contains only part of the body that we want.In that case 
it will be parsed only by the readers that are assigned to read the 
block that contains the starting tag of this element.


Yes, that sounds right.

On that note, I am currently working on a way to assign only one 
reader to each block, because hdfs assigns readers according to the 
available cores of the CPUs you use.That means the same block can be 
assigned to more than one readers and in our case that can lead to 
memory problems.


I'm not sure I fully understand the current design. Could you explain in 
a little more detail in which case you see which problem coming up (I 
can imagine a number of problems with memory ...)?


Cheers,
Till


On 22/05/2015 06:53 πμ, Till Westmann wrote:
(1) I agree that [1] looks better (thanks for the diagrams - we 
should add them to the docs!).
(2) I think that it’s ok to have the restriction, that the given 
tag
  (a) identifies the root element of the elements that we want to 
work with and
  (b) is not used recursively (and I would check this condition and 
fail if it doesn’t hold).


If we have a few really big nodes in the file, we anyway do not have 
a way to process them in parallel, so the chosen tags should split 
the document into a large number of smaller pieces for VXQuery to 
work well.


Wrt. to the question what happens if we start reading a block that 
does not contain the tag(s) in question (I think that that’s the 
last question - please correct me if I’m wrong) it would probably 
be read without producing any nodes that will be processed by the 
query engine. So the effort to do that would be wasted, but I would 
expect that the block would then be parsed again as the continuation 
of another block that contained a start tag.


Till


On May 21, 2015, at 2:59 PM, Steven Jacobs sjaco...@ucr.edu wrote:

This seems correct to me. Since our objective in implementing HDFS 
is to

deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts on 
this? In
the case of searching for all elements with a given name regardless 
of
depth, this method will work fine, but if we want a specific path, 
we could
end up opening lots of Blocks to guarantee path correctness, the 
entire

file in fact.
Steven

On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote:


Hello everyone,

For this week the two different methods for reading complete items
according to a specific tag are completed and tested in standalone 
hdfs

deployment.In detail what each method does:

The first method, I call it One Buffer Method, reads a block, saves 
it in
a buffer, and continues reading from the other blocks until it 
finds a
specific closing tag.It shows good results and good times in the 
tests.


The second method, called Shared File Method, reads only the 
complete
items contained in the block and the incomplete items from the 
start and
end of the block are send to a shared file in the hdfs Distributed 
Cache.
Now this method could work only for relatively small inputs, since 
the
Distributed Cache is limited and in the case of hundreds/thousands 
of

blocks the shared file can exceed the limit.

I took the liberty of creating diagrams that show in example what 
each

method does.
[1] One Buffer Method
[2] Shared File Method

Every insight and feedback is more than welcome about these two 
methods.In
my opinion the One Buffer method is simpler and more effective 
since it can

be used for both small and large datasets.

There is also a question, can the parser work on data that are 
missing
some tags?For example the first and last tag of the xml file that 
are

located in different blocks.

Best regards,
Efi

[1]

Re: [#131]Supporting Hadoop data and cluster management

2015-05-22 Thread Efi
Thank you for the recursively tag check, Steven told me about it 
yesterday as well.I hadnt thought of it so far but I will think of ways 
to implement it for these methods so it does not create problems.


My question was not exactly that, I was considering if the query engine 
could parse data that have complete elements but miss other tags from 
greater elements.
For example, one data that comes from either of these methods can look 
like this:


books

book
...
/book

And another one like this:

book

/book
...
/books

The query is about data inside the element book, will these work with 
the query engine?


About your answer for the scenario where a block does not contain the 
tags in question, it can mean two things.It is not part of the element 
we want to work with,so we simply ignore it, or it is part of the 
element but the starting and ending tags are in previous/next blocks. So 
this block contains only part of the body that we want.In that case it 
will be parsed only by the readers that are assigned to read the block 
that contains the starting tag of this element.


On that note, I am currently working on a way to assign only one reader 
to each block, because hdfs assigns readers according to the available 
cores of the CPUs you use.That means the same block can be assigned to 
more than one readers and in our case that can lead to memory problems.


Efi

On 22/05/2015 06:53 πμ, Till Westmann wrote:

(1) I agree that [1] looks better (thanks for the diagrams - we should add them 
to the docs!).
(2) I think that it’s ok to have the restriction, that the given tag
  (a) identifies the root element of the elements that we want to work with 
and
  (b) is not used recursively (and I would check this condition and fail if 
it doesn’t hold).

If we have a few really big nodes in the file, we anyway do not have a way to 
process them in parallel, so the chosen tags should split the document into a 
large number of smaller pieces for VXQuery to work well.

Wrt. to the question what happens if we start reading a block that does not 
contain the tag(s) in question (I think that that’s the last question - please 
correct me if I’m wrong) it would probably be read without producing any nodes 
that will be processed by the query engine. So the effort to do that would be 
wasted, but I would expect that the block would then be parsed again as the 
continuation of another block that contained a start tag.

Till


On May 21, 2015, at 2:59 PM, Steven Jacobs sjaco...@ucr.edu wrote:

This seems correct to me. Since our objective in implementing HDFS is to
deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts on this? In
the case of searching for all elements with a given name regardless of
depth, this method will work fine, but if we want a specific path, we could
end up opening lots of Blocks to guarantee path correctness, the entire
file in fact.
Steven

On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote:


Hello everyone,

For this week the two different methods for reading complete items
according to a specific tag are completed and tested in standalone hdfs
deployment.In detail what each method does:

The first method, I call it One Buffer Method, reads a block, saves it in
a buffer, and continues reading from the other blocks until it finds a
specific closing tag.It shows good results and good times in the tests.

The second method, called Shared File Method, reads only the complete
items contained in the block and the incomplete items from the start and
end of the block are send to a shared file in the hdfs Distributed Cache.
Now this method could work only for relatively small inputs, since the
Distributed Cache is limited and in the case of hundreds/thousands of
blocks the shared file can exceed the limit.

I took the liberty of creating diagrams that show in example what each
method does.
[1] One Buffer Method
[2] Shared File Method

Every insight and feedback is more than welcome about these two methods.In
my opinion the One Buffer method is simpler and more effective since it can
be used for both small and large datasets.

There is also a question, can the parser work on data that are missing
some tags?For example the first and last tag of the xml file that are
located in different blocks.

Best regards,
Efi

[1]
https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing

[2]
https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing




On 05/19/2015 12:43 AM, Michael Carey wrote:


+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:


Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote:

Hello everyone,

This is my update on what I have been doing this last week:

Created an XMLInputFormat java class with the functionalities that Hamza
described in the issue [1] .The class reads from 

Re: [#131]Supporting Hadoop data and cluster management

2015-05-21 Thread Steven Jacobs
This seems correct to me. Since our objective in implementing HDFS is to
deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts on this? In
the case of searching for all elements with a given name regardless of
depth, this method will work fine, but if we want a specific path, we could
end up opening lots of Blocks to guarantee path correctness, the entire
file in fact.
Steven

On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote:

 Hello everyone,

 For this week the two different methods for reading complete items
 according to a specific tag are completed and tested in standalone hdfs
 deployment.In detail what each method does:

 The first method, I call it One Buffer Method, reads a block, saves it in
 a buffer, and continues reading from the other blocks until it finds a
 specific closing tag.It shows good results and good times in the tests.

 The second method, called Shared File Method, reads only the complete
 items contained in the block and the incomplete items from the start and
 end of the block are send to a shared file in the hdfs Distributed Cache.
 Now this method could work only for relatively small inputs, since the
 Distributed Cache is limited and in the case of hundreds/thousands of
 blocks the shared file can exceed the limit.

 I took the liberty of creating diagrams that show in example what each
 method does.
 [1] One Buffer Method
 [2] Shared File Method

 Every insight and feedback is more than welcome about these two methods.In
 my opinion the One Buffer method is simpler and more effective since it can
 be used for both small and large datasets.

 There is also a question, can the parser work on data that are missing
 some tags?For example the first and last tag of the xml file that are
 located in different blocks.

 Best regards,
 Efi

 [1]
 https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing

 [2]
 https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing




 On 05/19/2015 12:43 AM, Michael Carey wrote:

 +1 Sounds great!

 On 5/18/15 8:33 AM, Steven Jacobs wrote:

 Great work!
 Steven

 On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote:

  Hello everyone,

 This is my update on what I have been doing this last week:

 Created an XMLInputFormat java class with the functionalities that Hamza
 described in the issue [1] .The class reads from blocks located in HDFS
 and
 returns complete items according to a specified xml tag.
 I also tested this class in a standalone hadoop cluster with xml files
 of
 various sizes, the smallest being a single file of 400 MB and the
 largest a
 collection of 5 files totalling 6.1 GB.

 This week I will create another implementation of the XMLInputFormat
 with
 a different way of reading and delivering files, the way I described in
 the
 same issue and I will test both solutions in a standalone and a small
 hadoop cluster (5-6 nodes).

 You can see this week's results here [2] .I will keep updating this file
 about the other tests.

 Best regards,
 Efi

 [1] https://issues.apache.org/jira/browse/VXQUERY-131
 [2]

 https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing








Re: [#131]Supporting Hadoop data and cluster management

2015-05-21 Thread Till Westmann
(1) I agree that [1] looks better (thanks for the diagrams - we should add them 
to the docs!).
(2) I think that it’s ok to have the restriction, that the given tag
 (a) identifies the root element of the elements that we want to work with 
and
 (b) is not used recursively (and I would check this condition and fail if 
it doesn’t hold).

If we have a few really big nodes in the file, we anyway do not have a way to 
process them in parallel, so the chosen tags should split the document into a 
large number of smaller pieces for VXQuery to work well. 

Wrt. to the question what happens if we start reading a block that does not 
contain the tag(s) in question (I think that that’s the last question - please 
correct me if I’m wrong) it would probably be read without producing any nodes 
that will be processed by the query engine. So the effort to do that would be 
wasted, but I would expect that the block would then be parsed again as the 
continuation of another block that contained a start tag. 

Till

 On May 21, 2015, at 2:59 PM, Steven Jacobs sjaco...@ucr.edu wrote:
 
 This seems correct to me. Since our objective in implementing HDFS is to
 deal with very large XML files, I think we should avoid any size
 limitations. Regarding the tags, does anyone have any thoughts on this? In
 the case of searching for all elements with a given name regardless of
 depth, this method will work fine, but if we want a specific path, we could
 end up opening lots of Blocks to guarantee path correctness, the entire
 file in fact.
 Steven
 
 On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote:
 
 Hello everyone,
 
 For this week the two different methods for reading complete items
 according to a specific tag are completed and tested in standalone hdfs
 deployment.In detail what each method does:
 
 The first method, I call it One Buffer Method, reads a block, saves it in
 a buffer, and continues reading from the other blocks until it finds a
 specific closing tag.It shows good results and good times in the tests.
 
 The second method, called Shared File Method, reads only the complete
 items contained in the block and the incomplete items from the start and
 end of the block are send to a shared file in the hdfs Distributed Cache.
 Now this method could work only for relatively small inputs, since the
 Distributed Cache is limited and in the case of hundreds/thousands of
 blocks the shared file can exceed the limit.
 
 I took the liberty of creating diagrams that show in example what each
 method does.
 [1] One Buffer Method
 [2] Shared File Method
 
 Every insight and feedback is more than welcome about these two methods.In
 my opinion the One Buffer method is simpler and more effective since it can
 be used for both small and large datasets.
 
 There is also a question, can the parser work on data that are missing
 some tags?For example the first and last tag of the xml file that are
 located in different blocks.
 
 Best regards,
 Efi
 
 [1]
 https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
 
 [2]
 https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
 
 
 
 
 On 05/19/2015 12:43 AM, Michael Carey wrote:
 
 +1 Sounds great!
 
 On 5/18/15 8:33 AM, Steven Jacobs wrote:
 
 Great work!
 Steven
 
 On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote:
 
 Hello everyone,
 
 This is my update on what I have been doing this last week:
 
 Created an XMLInputFormat java class with the functionalities that Hamza
 described in the issue [1] .The class reads from blocks located in HDFS
 and
 returns complete items according to a specified xml tag.
 I also tested this class in a standalone hadoop cluster with xml files
 of
 various sizes, the smallest being a single file of 400 MB and the
 largest a
 collection of 5 files totalling 6.1 GB.
 
 This week I will create another implementation of the XMLInputFormat
 with
 a different way of reading and delivering files, the way I described in
 the
 same issue and I will test both solutions in a standalone and a small
 hadoop cluster (5-6 nodes).
 
 You can see this week's results here [2] .I will keep updating this file
 about the other tests.
 
 Best regards,
 Efi
 
 [1] https://issues.apache.org/jira/browse/VXQUERY-131
 [2]
 
 https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
 
 
 
 
 
 



Re: [#131]Supporting Hadoop data and cluster management

2015-05-21 Thread Efi

Hello everyone,

For this week the two different methods for reading complete items 
according to a specific tag are completed and tested in standalone hdfs 
deployment.In detail what each method does:


The first method, I call it One Buffer Method, reads a block, saves it 
in a buffer, and continues reading from the other blocks until it finds 
a specific closing tag.It shows good results and good times in the tests.


The second method, called Shared File Method, reads only the complete 
items contained in the block and the incomplete items from the start and 
end of the block are send to a shared file in the hdfs Distributed 
Cache. Now this method could work only for relatively small inputs, 
since the Distributed Cache is limited and in the case of 
hundreds/thousands of blocks the shared file can exceed the limit.


I took the liberty of creating diagrams that show in example what each 
method does.

[1] One Buffer Method
[2] Shared File Method

Every insight and feedback is more than welcome about these two 
methods.In my opinion the One Buffer method is simpler and more 
effective since it can be used for both small and large datasets.


There is also a question, can the parser work on data that are missing 
some tags?For example the first and last tag of the xml file that are 
located in different blocks.


Best regards,
Efi

[1] 
https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing


[2] 
https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing




On 05/19/2015 12:43 AM, Michael Carey wrote:

+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:

Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote:


Hello everyone,

This is my update on what I have been doing this last week:

Created an XMLInputFormat java class with the functionalities that 
Hamza
described in the issue [1] .The class reads from blocks located in 
HDFS and

returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xml 
files of
various sizes, the smallest being a single file of 400 MB and the 
largest a

collection of 5 files totalling 6.1 GB.

This week I will create another implementation of the XMLInputFormat 
with
a different way of reading and delivering files, the way I described 
in the

same issue and I will test both solutions in a standalone and a small
hadoop cluster (5-6 nodes).

You can see this week's results here [2] .I will keep updating this 
file

about the other tests.

Best regards,
Efi

[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing 











Re: [#131]Supporting Hadoop data and cluster management

2015-05-18 Thread Steven Jacobs
Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote:

 Hello everyone,

 This is my update on what I have been doing this last week:

 Created an XMLInputFormat java class with the functionalities that Hamza
 described in the issue [1] .The class reads from blocks located in HDFS and
 returns complete items according to a specified xml tag.
 I also tested this class in a standalone hadoop cluster with xml files of
 various sizes, the smallest being a single file of 400 MB and the largest a
 collection of 5 files totalling 6.1 GB.

 This week I will create another implementation of the XMLInputFormat with
 a different way of reading and delivering files, the way I described in the
 same issue and I will test both solutions in a standalone and a small
 hadoop cluster (5-6 nodes).

 You can see this week's results here [2] .I will keep updating this file
 about the other tests.

 Best regards,
 Efi

 [1] https://issues.apache.org/jira/browse/VXQUERY-131
 [2]
 https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing




Re: [#131]Supporting Hadoop data and cluster management

2015-05-18 Thread Michael Carey

+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:

Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote:


Hello everyone,

This is my update on what I have been doing this last week:

Created an XMLInputFormat java class with the functionalities that Hamza
described in the issue [1] .The class reads from blocks located in HDFS and
returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xml files of
various sizes, the smallest being a single file of 400 MB and the largest a
collection of 5 files totalling 6.1 GB.

This week I will create another implementation of the XMLInputFormat with
a different way of reading and delivering files, the way I described in the
same issue and I will test both solutions in a standalone and a small
hadoop cluster (5-6 nodes).

You can see this week's results here [2] .I will keep updating this file
about the other tests.

Best regards,
Efi

[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing