[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management

Efi Kaltirimidou (JIRA) Tue, 28 Apr 2015 04:14:23 -0700

    [ 
https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516822#comment-14516822
 ]


Efi Kaltirimidou commented on VXQUERY-131:
------------------------------------------

Hi Steven,
   About reading data from HDFS, the problem,as you mentioned is that the file 
is separated into blocks and the items in the start and end of the file may be 
incomplete. So a simple solution could be checking the first and last lines of 
every block, in case the attributes in the start and end of the block are 
starting/ending tags it means the item is complete.In the opposite case the 
item is incomplete and the rest of it is in the next/previous block of the 
file. 
  In the case of an incomplete item, it can be sent to a selected node of the 
cluster with additional information, the filename and the id of the block it 
belongs to. And the query can run on the rest of the file.
  That way there would be a buffer in that selected node with the incomplete 
items from every block,so that node will be responsible for assembling the 
items according to the filename and the block id.After they are complete, the 
query can run on that file as well.

Efi

> Supporting Hadoop data and cluster management
> ---------------------------------------------
>
>                 Key: VXQUERY-131
>                 URL: https://issues.apache.org/jira/browse/VXQUERY-131
>             Project: VXQuery
>          Issue Type: Improvement
>            Reporter: Preston Carman
>            Assignee: Preston Carman
>              Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data 
> from this source. The project will include creating a strategy (with the 
> mentor's guidance) for reading XML data from HDFS and implementing it. When 
> connecting VXQuery to HDFS, the strategy may need to consider how to read 
> sections of an XML file. 
> In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN 
> (Yet Another Resource Negotiator) would be a good cluster management tool for 
> VXQuery. If VXQuery can read data from HDFS, then why not also manage the 
> cluster with a tool provided by Hadoop. The solution would replace the 
> current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage the VXQuery cluster with Yarn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management

Reply via email to