[ 
https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367290#comment-14367290
 ] 

Hamza Zafar commented on VXQUERY-131:
-------------------------------------

Since HDFS supports blocks of fixed sizes (default is 64mb), XML files will be 
divided into several blocks. The blocks will be given to datanodes at different 
machines. Processing the chunks of XML file in parallel requires launching the 
VXQuery containers at nodes where the blocks of XML file resides. Hence the 
queries will work on blocks in local storage. How do you plan to aggregate the 
results? Will there be any VXQuery reducer process, which can receive the 
results from other VXQuery containers (which processed the local xml blocks) ? 
If there is a VXQuery reducer, what would be the communication mechanism 
between VXQuery containers and VXQuery reducer ?  

> Supporting Hadoop data and cluster management
> ---------------------------------------------
>
>                 Key: VXQUERY-131
>                 URL: https://issues.apache.org/jira/browse/VXQUERY-131
>             Project: VXQuery
>          Issue Type: Improvement
>            Reporter: Preston Carman
>            Assignee: Preston Carman
>              Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data 
> from this source. The project will include creating a strategy (with the 
> mentor's guidance) for reading XML data from HDFS and implementing it. When 
> connecting VXQuery to HDFS, the strategy may need to consider how to read 
> sections of an XML file. 
> In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN 
> (Yet Another Resource Negotiator) would be a good cluster management tool for 
> VXQuery. If VXQuery can read data from HDFS, then why not also manage the 
> cluster with a tool provided by Hadoop. The solution would replace the 
> current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage the VXQuery cluster with Yarn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to