[
https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350805#comment-14350805
]
Preston Carman commented on VXQUERY-131:
----------------------------------------
The high level goal is to execute an XQuery statement on a cluster accessing
HDFS data (or local files). To accomplish this task we need to do the following:
- Deploy VXQuery.
- Start the cluster controller.
- Start all the data nodes.
- Execute the query. (The query includes parsing XML and producing a result.)
- Stop the data nodes.
- Stop the cluster controller.
The above could be a single Yarn job.
When we consider Yarn and HDFS, we can start looking at how to optimize this
process for parallel execution. Does the query transfer files over the network
or access local data? How can we most efficiently read XML data from HDFS? Does
reading XML maintain XQuery's document order? Does the parser need to be update
to operate on partial XML data? Does VXQuery need its own custom input format?
Just a few of my thoughts on issues to be considered with this project.
Currently available:
- The python cluster management scripts can deploy, start, and stop a cluster.
- A command line interface to execute a query on the cluster.
- An XML parser for local files (must be complete XML file).
FYI, the cluster management scripts are in Python, but VXQuery is written in
Java.
> Supporting Hadoop data and cluster management
> ---------------------------------------------
>
> Key: VXQUERY-131
> URL: https://issues.apache.org/jira/browse/VXQUERY-131
> Project: VXQuery
> Issue Type: Improvement
> Reporter: Preston Carman
> Assignee: Preston Carman
> Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data
> from this source. The project will include creating a strategy (with the
> mentor's guidance) for reading XML data from HDFS and implementing it. When
> connecting VXQuery to HDFS, the strategy may need to consider how to read
> sections of an XML file.
> In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN
> (Yet Another Resource Negotiator) would be a good cluster management tool for
> VXQuery. If VXQuery can read data from HDFS, then why not also manage the
> cluster with a tool provided by Hadoop. The solution would replace the
> current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage the VXQuery cluster with Yarn
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)