[ 
https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505258#comment-14505258
 ] 

Steven Jacobs commented on VXQUERY-131:
---------------------------------------

One of the biggest issues for the later part of this project will be reading 
large XML files from HDFS. Since XML uses nested tags, there is a problem with 
trying to read a distributed file. For example, if you opened only the second 
block of an XML file, you wouldn't have the context for what tags the first 
line of this block of the data is contained under. This is a little bit of an 
open problem. The initial solution would be to read the entire file together, 
but this will lose some of the benefits of HDFS. I would recommend reading some 
of the existing solutions/questions for how to solve this problem.

For your second question, I think the best place to start is by looking at the 
current options for setting up the cluster in VXQuery, and looking at the 
lifecycle for the cluster, as well as individual jobs. I think this will give 
you the starting knowledge of how a VXQuery cluster works, and how it can 
relate to Yarn

Steven



> Supporting Hadoop data and cluster management
> ---------------------------------------------
>
>                 Key: VXQUERY-131
>                 URL: https://issues.apache.org/jira/browse/VXQUERY-131
>             Project: VXQuery
>          Issue Type: Improvement
>            Reporter: Preston Carman
>            Assignee: Preston Carman
>              Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data 
> from this source. The project will include creating a strategy (with the 
> mentor's guidance) for reading XML data from HDFS and implementing it. When 
> connecting VXQuery to HDFS, the strategy may need to consider how to read 
> sections of an XML file. 
> In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN 
> (Yet Another Resource Negotiator) would be a good cluster management tool for 
> VXQuery. If VXQuery can read data from HDFS, then why not also manage the 
> cluster with a tool provided by Hadoop. The solution would replace the 
> current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage the VXQuery cluster with Yarn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to