[
https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hamza Zafar updated VXQUERY-131:
--------------------------------
Comment: was deleted
(was: Hey Efi,
Congrats for getting selected! While I was researching this problem, I thought
of writing an XMLInputFormat and XMLRecordReader classes to handle incomplete
XML documents. I came across a nice implementation for handling XML files---in
HDFS---by Apache Mahout[1]. The general idea is to see if the xml record is
incomplete then read in the next block---which might reside on different
node---until you find the closing tag.
[1]https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java)
> Supporting Hadoop data and cluster management
> ---------------------------------------------
>
> Key: VXQUERY-131
> URL: https://issues.apache.org/jira/browse/VXQUERY-131
> Project: VXQuery
> Issue Type: Improvement
> Reporter: Preston Carman
> Assignee: Preston Carman
> Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data
> from this source. The project will include creating a strategy (with the
> mentor's guidance) for reading XML data from HDFS and implementing it. When
> connecting VXQuery to HDFS, the strategy may need to consider how to read
> sections of an XML file.
> In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN
> (Yet Another Resource Negotiator) would be a good cluster management tool for
> VXQuery. If VXQuery can read data from HDFS, then why not also manage the
> cluster with a tool provided by Hadoop. The solution would replace the
> current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage the VXQuery cluster with Yarn
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)