Re: [#131] Supporting Hadoop data and cluster management

Efi Wed, 27 May 2015 02:33:16 -0700

If I understand correctly the problem you did not understand, is the oneabout the block assignment to mappers.

I believe this is a Hadoop functionality, the number of mappers itassigns for a job is equal to the available cpus cores.If the number ofblocks is less than the number of mappers, the same block will beassigned to more than one mapper for parsing.The problem is that themappers in the shame machine share the allowed memory, which means themore mappers the less memory for each one of them.

For example, in a machine with 4 cores, hadoop will assign 4 mappers. Ifour input has only two blocks,block1 and block2, they will be given toall mappers for parsing.So mapper1 and mapper2 will both get block1,mapper3 and mapper4 will get block2.So the available memory of this nodewill be distributed among 4 mappers that will parse 2 blocks.I want tomake it so that there would be only 2 mappers in order to get morememory for each one.

As an alternative solution, I am also viewing the hyracks code abouthdfs and hdfs2 in order to use that instead of the map-reduce frameworkfor reading the blocks.


I hope I answered your question.

Best regards,
Efi

On 25/05/2015 07:55 πμ, Till Westmann wrote:

On 22 May 2015, at 3:26, Efi wrote:
Thank you for the recursively tag check, Steven told me about ityesterday as well.I hadnt thought of it so far but I will think ofways to implement it for these methods so it does not create problems.
My question was not exactly that, I was considering if the queryengine could parse data that have complete elements but miss othertags from greater elements.For example, one data that comes from either of these methods canlook like this:
<books>
....
<book>
...
</book>

And another one like this:

<book>
....
</book>
...
</books>
The query is about data inside the element book, will these work withthe query engine?
I would hope so. I assume, that everything before the fist <book> andbetween a </book> and the next <book> should be ignored. Andeverything between a <book> and a </book> is probably parsed andpassed to the query engine.
Does that make sense?
About your answer for the scenario where a block does not contain thetags in question, it can mean two things.It is not part of theelement we want to work with,so we simply ignore it, or it is part ofthe element but the starting and ending tags are in previous/nextblocks. So this block contains only part of the body that we want.Inthat case it will be parsed only by the readers that are assigned toread the block that contains the starting tag of this element.
Yes, that sounds right.
On that note, I am currently working on a way to assign only onereader to each block, because hdfs assigns readers according to theavailable cores of the CPUs you use.That means the same block can beassigned to more than one readers and in our case that can lead tomemory problems.
I'm not sure I fully understand the current design. Could you explainin a little more detail in which case you see which problem coming up(I can imagine a number of problems with memory ...)?
Cheers,
Till
On 22/05/2015 06:53 πμ, Till Westmann wrote:
(1) I agree that [1] looks better (thanks for the diagrams - weshould add them to the docs!).
(2) I think that it’s ok to have the restriction, that the given tag
(a) identifies the root element of the elements that we want towork with and(b) is not used recursively (and I would check this condition andfail if it doesn’t hold).
If we have a few really big nodes in the file, we anyway do not havea way to process them in parallel, so the chosen tags should splitthe document into a large number of smaller pieces for VXQuery towork well.
Wrt. to the question what happens if we start reading a block thatdoes not contain the tag(s) in question (I think that that’s thelast question - please correct me if I’m wrong) it would probably beread without producing any nodes that will be processed by the queryengine. So the effort to do that would be wasted, but I would expectthat the block would then be parsed again as the continuation ofanother block that contained a start tag.
Till
On May 21, 2015, at 2:59 PM, Steven Jacobs <sjaco...@ucr.edu> wrote:
This seems correct to me. Since our objective in implementing HDFSis to
deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts onthis? In
the case of searching for all elements with a given name regardless of
depth, this method will work fine, but if we want a specific path,we couldend up opening lots of Blocks to guarantee path correctness, theentire
file in fact.
Steven

On Thu, May 21, 2015 at 10:20 AM, Efi <efika...@gmail.com> wrote:
Hello everyone,

For this week the two different methods for reading complete items
according to a specific tag are completed and tested in standalonehdfs
deployment.In detail what each method does:
The first method, I call it One Buffer Method, reads a block,saves it ina buffer, and continues reading from the other blocks until itfinds aspecific closing tag.It shows good results and good times in thetests.
The second method, called Shared File Method, reads only the complete
items contained in the block and the incomplete items from thestart andend of the block are send to a shared file in the hdfs DistributedCache.Now this method could work only for relatively small inputs, sincethe
Distributed Cache is limited and in the case of hundreds/thousands of
blocks the shared file can exceed the limit.
I took the liberty of creating diagrams that show in example whateach
method does.
[1] One Buffer Method
[2] Shared File Method
Every insight and feedback is more than welcome about these twomethods.Inmy opinion the One Buffer method is simpler and more effectivesince it can
be used for both small and large datasets.
There is also a question, can the parser work on data that aremissing
some tags?For example the first and last tag of the xml file that are
located in different blocks.

Best regards,
Efi

[1]
https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
[2]
https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
On 05/19/2015 12:43 AM, Michael Carey wrote:
+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:
Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi <efika...@gmail.com> wrote:

Hello everyone,
This is my update on what I have been doing this last week:
Created an XMLInputFormat java class with the functionalitiesthat Hamzadescribed in the issue [1] .The class reads from blocks locatedin HDFS
and
returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster withxml files
of
various sizes, the smallest being a single file of 400 MB and the
largest a
collection of 5 files totalling 6.1 GB.
This week I will create another implementation of theXMLInputFormat
with
a different way of reading and delivering files, the way Idescribed in
the
same issue and I will test both solutions in a standalone and asmall
hadoop cluster (5-6 nodes).
You can see this week's results here [2] .I will keep updatingthis file
about the other tests.

Best regards,
Efi

[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2]
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing

Re: [#131] Supporting Hadoop data and cluster management

Reply via email to