[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management

Preston Carman (JIRA) Wed, 18 Mar 2015 11:23:35 -0700

    [ 
https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367609#comment-14367609
 ]


Preston Carman commented on VXQUERY-131:
----------------------------------------

Check out this application template for GSOC: 
http://community.staging.apache.org/gsoc#application-template

VXQuery currently reads data from local files. The system understands the data 
partitions across nodes and creates Hyracks jobs to read local data and 
communicate results across nodes only when needed for the given query. HDFS may 
affect the Hyracks job creating or it may be independent. Depends on our 
approach. I see a few options on the road to an efficient XQuery on HDFS.

- Can we do a simple query on HDFS? (Start by reading a local file and transfer 
any additional file blocks as necessary to read the whole XML file. Loses 
efficiency when processing multiple block files.)
- Can we read a partial XML file on HDFS? (Read only XML on local nodes, but 
upgrade the parser to read partial XML documents. Loses some XQuery properties.)
- Create a new HDFS file loader and reader to better handle the XML document 
properties for processing XQueries.

In each of the cases, I assume that after data is read in, the VXQuery job can 
handle the rest. The result of the project may lead to basic approach followed 
by and optimized method once we understand the issues better.

> Supporting Hadoop data and cluster management
> ---------------------------------------------
>
>                 Key: VXQUERY-131
>                 URL: https://issues.apache.org/jira/browse/VXQUERY-131
>             Project: VXQuery
>          Issue Type: Improvement
>            Reporter: Preston Carman
>            Assignee: Preston Carman
>              Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data 
> from this source. The project will include creating a strategy (with the 
> mentor's guidance) for reading XML data from HDFS and implementing it. When 
> connecting VXQuery to HDFS, the strategy may need to consider how to read 
> sections of an XML file. 
> In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN 
> (Yet Another Resource Negotiator) would be a good cluster management tool for 
> VXQuery. If VXQuery can read data from HDFS, then why not also manage the 
> cluster with a tool provided by Hadoop. The solution would replace the 
> current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage the VXQuery cluster with Yarn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management

Reply via email to