[Supporting Hadoop data and cluster management] weekly update

2015-07-26 Thread Efi

Hello everyone,

The update for this week consists of two parts,the CollectionWithTagRule 
that is about reading the blocks from HDFS using the XMLInputFormat 
class.This rule informs the parser that it needs to read its data in 
blocks from HDFS and passes some additional information that are needed 
in order to read the items correctly.I made one change in the 
XMLInputFormat class, the class reads a block from HDFS and looks for 
the starting and closing tag that the user specified in his query.Until 
now I did not take into account that in the opening tag there may be 
more information refarding the item, for example:

book name=something
...
...
/book

but I was only looking for tags like:
book
...
...
/book

I changed that to take into account that the opening tag may contain 
additional information and to include it in the returning item.


The second part of the update is about the YARN applications, slider and 
twill that I tested this week and my conclusions about which can be used 
with vxquery better.
   - Slider: Requires mostly configuration files and python scripts for 
the application to work which I find very good and generic because with 
little changes to the configuration you can use the same work in similar 
projects.
   - Twill: Requires zookeeper installed along with YARN in order to 
work.This application needs mostly changes in the code of the project 
you want to use with Twill.


Based on these I find slider, yet again, a better candidate.Still if 
anyone has more experience with any of these systems I would like to 
give me some feedback on my observations and of course which one is best.


Thank you,
Efi


Re: [Supporting Hadoop data and cluster management] weekly update

2015-07-04 Thread Efi

Hello everyone,

This week's update is about the changes that I mentioned in my last 
update.The JUnit test is not completed yet,I am using a MiniDFSCluster 
implementation for the tests but I havent managed to get it to work 
correctly yet.I believe the problems are trivial and have not reported 
them in the ticket so far.I will create a ticket if I continue to 
receive the same errors.


About the input splits, I have implemented a scheduler that maps 
which split should be processed by which node according to the split's 
location and the number of splits - nodes.I need to test this as well 
before I commit it.

Thats all for this week.

Best regards,
Efi

On 25/06/2015 07:30 μμ, Efi wrote:
Thank you Eldon, that's was very helpful and I had completely 
overlooked it when I first setup up my eclipse for vxquery.


This week I continue working on reading blocks from HDFS, I used some 
of the hyracks-hdfs-core classes and methods and I was able to get the 
splits of input files from HDFS without having to use a Map function.I 
will continue working on how to distribute and read correctly the 
splits between the nodes of the vxquery cluster.


I will also do some changes to the JUnit tests for HDFS.They will 
start a temporary dfs cluster in order to run the tests instead of 
just failing when the user does not have an HDFS cluster.


Cheers,
Efi

On 16/06/2015 08:42 μμ, Eldon Carman wrote:

Looks good. One quick comment, take a look at our code format and style
guidelines. You can set up eclipse to format your code for you using our
sister project's code format profile [1].

[1] http://vxquery.apache.org/development_eclipse_setup.html

On Sat, Jun 13, 2015 at 11:03 AM, Michael Carey mjca...@ics.uci.edu 
wrote:



Very cool!!


On 6/13/15 9:38 AM, Efi wrote:


Hello everyone,

The reading of a single document and a collection of documents from 
HDFS
is completed and tested.New JUnit tests are added in the xtest 
project,
they are just copies of the aggregate tests, that I changed a bit 
to run

for the collection reading from HDFS.

I added another option in the xtest in order for the HDFS tests to run
successfully.It is a boolean option called /hdfs/ and it enables 
the tests

for HDFS to run.

You can view these in the branch /hdfs2_read/ in my github fork of
vxquery. [1]

I will continue with the parallel reading from HDFS.

Best Regards,
Efi

[1] https://github.com/efikalti/vxquery/tree/hdfs2_read

On 04/06/2015 08:50 μμ, Eldon Carman wrote:

We have a set of JUnit tests to validate VXQuery. I think it would 
be a
good idea to add test cases that validate the HDFS code your 
adding to

the
code base. Take a look at the vxquery-xtest sub-project. The VXQuery
Catalog holds all the vxquery test cases [1]. You could add a new 
HDFS

test
group to this list catalog.

1.

https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml 



On Thu, Jun 4, 2015 at 10:26 AM, Efi efika...@gmail.com wrote:

  Hello everyone,

This week Preston and Steven helped me with the vxquery code and
specifically where my parser and two more functionalities will 
fit in

the
code.

Along with the hdfs parallel parser that I have been working on 
these

past
weeks,two more methods will be implemented.They will both read whole
files
from hdfs and not just blocks.The one will read all the files 
located

in a
directory in hdfs and the other will read a single document.

The reading of files from a directory is completed and for the next
week I
will focus on testing it and implementing/testing the second method,
reading of a single document.

Best regards,
Efi










Re: [Supporting Hadoop data and cluster management] weekly update

2015-06-16 Thread Eldon Carman
Looks good. One quick comment, take a look at our code format and style
guidelines. You can set up eclipse to format your code for you using our
sister project's code format profile [1].

[1] http://vxquery.apache.org/development_eclipse_setup.html

On Sat, Jun 13, 2015 at 11:03 AM, Michael Carey mjca...@ics.uci.edu wrote:

 Very cool!!


 On 6/13/15 9:38 AM, Efi wrote:

 Hello everyone,

 The reading of a single document and a collection of documents from HDFS
 is completed and tested.New JUnit tests are added in the xtest project,
 they are just copies of the aggregate tests, that I changed a bit to run
 for the collection reading from HDFS.

 I added another option in the xtest in order for the HDFS tests to run
 successfully.It is a boolean option called /hdfs/ and it enables the tests
 for HDFS to run.

 You can view these in the branch /hdfs2_read/ in my github fork of
 vxquery. [1]

 I will continue with the parallel reading from HDFS.

 Best Regards,
 Efi

 [1] https://github.com/efikalti/vxquery/tree/hdfs2_read

 On 04/06/2015 08:50 μμ, Eldon Carman wrote:

 We have a set of JUnit tests to validate VXQuery. I think it would be a
 good idea to add test cases that validate the HDFS code your adding to
 the
 code base. Take a look at the vxquery-xtest sub-project. The VXQuery
 Catalog holds all the vxquery test cases [1]. You could add a new HDFS
 test
 group to this list catalog.

 1.

 https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml

 On Thu, Jun 4, 2015 at 10:26 AM, Efi efika...@gmail.com wrote:

  Hello everyone,

 This week Preston and Steven helped me with the vxquery code and
 specifically where my parser and two more functionalities will fit in
 the
 code.

 Along with the hdfs parallel parser that I have been working on these
 past
 weeks,two more methods will be implemented.They will both read whole
 files
 from hdfs and not just blocks.The one will read all the files located
 in a
 directory in hdfs and the other will read a single document.

 The reading of files from a directory is completed and for the next
 week I
 will focus on testing it and implementing/testing the second method,
 reading of a single document.

 Best regards,
 Efi







Re: [Supporting Hadoop data and cluster management] weekly update

2015-06-13 Thread Efi

Hello everyone,

The reading of a single document and a collection of documents from HDFS 
is completed and tested.New JUnit tests are added in the xtest project, 
they are just copies of the aggregate tests, that I changed a bit to run 
for the collection reading from HDFS.


I added another option in the xtest in order for the HDFS tests to run 
successfully.It is a boolean option called /hdfs/ and it enables the 
tests for HDFS to run.


You can view these in the branch /hdfs2_read/ in my github fork of 
vxquery. [1]


I will continue with the parallel reading from HDFS.

Best Regards,
Efi

[1] https://github.com/efikalti/vxquery/tree/hdfs2_read

On 04/06/2015 08:50 μμ, Eldon Carman wrote:

We have a set of JUnit tests to validate VXQuery. I think it would be a
good idea to add test cases that validate the HDFS code your adding to the
code base. Take a look at the vxquery-xtest sub-project. The VXQuery
Catalog holds all the vxquery test cases [1]. You could add a new HDFS test
group to this list catalog.

1.
https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml

On Thu, Jun 4, 2015 at 10:26 AM, Efi efika...@gmail.com wrote:


Hello everyone,

This week Preston and Steven helped me with the vxquery code and
specifically where my parser and two more functionalities will fit in the
code.

Along with the hdfs parallel parser that I have been working on these past
weeks,two more methods will be implemented.They will both read whole files
from hdfs and not just blocks.The one will read all the files located in a
directory in hdfs and the other will read a single document.

The reading of files from a directory is completed and for the next week I
will focus on testing it and implementing/testing the second method,
reading of a single document.

Best regards,
Efi





Re: [Supporting Hadoop data and cluster management] weekly update

2015-06-13 Thread Michael Carey

Very cool!!

On 6/13/15 9:38 AM, Efi wrote:

Hello everyone,

The reading of a single document and a collection of documents from 
HDFS is completed and tested.New JUnit tests are added in the xtest 
project, they are just copies of the aggregate tests, that I changed a 
bit to run for the collection reading from HDFS.


I added another option in the xtest in order for the HDFS tests to run 
successfully.It is a boolean option called /hdfs/ and it enables the 
tests for HDFS to run.


You can view these in the branch /hdfs2_read/ in my github fork of 
vxquery. [1]


I will continue with the parallel reading from HDFS.

Best Regards,
Efi

[1] https://github.com/efikalti/vxquery/tree/hdfs2_read

On 04/06/2015 08:50 μμ, Eldon Carman wrote:

We have a set of JUnit tests to validate VXQuery. I think it would be a
good idea to add test cases that validate the HDFS code your adding 
to the

code base. Take a look at the vxquery-xtest sub-project. The VXQuery
Catalog holds all the vxquery test cases [1]. You could add a new 
HDFS test

group to this list catalog.

1.
https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml 



On Thu, Jun 4, 2015 at 10:26 AM, Efi efika...@gmail.com wrote:


Hello everyone,

This week Preston and Steven helped me with the vxquery code and
specifically where my parser and two more functionalities will fit 
in the

code.

Along with the hdfs parallel parser that I have been working on 
these past
weeks,two more methods will be implemented.They will both read whole 
files
from hdfs and not just blocks.The one will read all the files 
located in a

directory in hdfs and the other will read a single document.

The reading of files from a directory is completed and for the next 
week I

will focus on testing it and implementing/testing the second method,
reading of a single document.

Best regards,
Efi








Re: [Supporting Hadoop data and cluster management] weekly update

2015-06-04 Thread Eldon Carman
We have a set of JUnit tests to validate VXQuery. I think it would be a
good idea to add test cases that validate the HDFS code your adding to the
code base. Take a look at the vxquery-xtest sub-project. The VXQuery
Catalog holds all the vxquery test cases [1]. You could add a new HDFS test
group to this list catalog.

1.
https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml

On Thu, Jun 4, 2015 at 10:26 AM, Efi efika...@gmail.com wrote:

 Hello everyone,

 This week Preston and Steven helped me with the vxquery code and
 specifically where my parser and two more functionalities will fit in the
 code.

 Along with the hdfs parallel parser that I have been working on these past
 weeks,two more methods will be implemented.They will both read whole files
 from hdfs and not just blocks.The one will read all the files located in a
 directory in hdfs and the other will read a single document.

 The reading of files from a directory is completed and for the next week I
 will focus on testing it and implementing/testing the second method,
 reading of a single document.

 Best regards,
 Efi



[Supporting Hadoop data and cluster management] weekly update

2015-06-04 Thread Efi

Hello everyone,

This week Preston and Steven helped me with the vxquery code and 
specifically where my parser and two more functionalities will fit in 
the code.


Along with the hdfs parallel parser that I have been working on these 
past weeks,two more methods will be implemented.They will both read 
whole files from hdfs and not just blocks.The one will read all the 
files located in a directory in hdfs and the other will read a single 
document.


The reading of files from a directory is completed and for the next week 
I will focus on testing it and implementing/testing the second method, 
reading of a single document.


Best regards,
Efi


[Supporting Hadoop data and cluster management] weekly update

2015-05-28 Thread Efi
For this week I studied the VXQuery and Hyracks code in detail, in order 
to add my parser to the project.


I will continue working on adding my code to vxquery and try to 
implement some tests for it as well.Also I am looking into ways to use 
the Hyracks hdfs code for the hdfs parser.


Thank you,
Efi