GitHub user shubham-pathak22 opened a pull request:
https://github.com/apache/incubator-apex-malhar/pull/219
APEXMALHAR-2014 added parquet reader
Adding **ParquetReaderOperator** which will allow apex users to read
parquet files.
**Apache Parquet** is a columnar storage format available to any project in
the Hadoop ecosystem, regardless of the choice of data processing framework,
data model or programming language.
For more information : [Apache Parquet]
(https://parquet.apache.org/documentation/latest/ "Apache Parquet")
#### Implementation Details
* **AbstractParquetFileReaderOperator** extends from
**AbstractFileInputOperator**. Overrides *openFile()* and *readEntity()*
methods.
* *openFile()* method instantiates a *ParquetReader* ( reader provided by
parquet-mr project that reads parquet records from a file ) with
*GroupReadSupport* ( records would be read as *Group* ) .
* *readEntity()* method reads the records and calls *convertGroup()*
method. Derived classes to override *convertGroup()* method to convert
*Group* to any form required by downstream operators.
* Provided **ParquetFilePOJOReader** operator which is a concrete
implementation of **AbstractParquetFileReader** to read Parquet files and emits
records as POJOs. The
POJO class name & field mapping should be provided by the user. If this
mapping is not provided then reflection is used to determine this
mapping. As
of now only primitive types ( INT32, INT64, BOOLEAN, FLOAT, DOUBLE,
BINARY )
are supported.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/shubham-pathak22/incubator-apex-malhar
APEXMALHAR-2014
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-apex-malhar/pull/219.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #219
----
commit d5c9415959f583caf68292cfb3cfac262d49eb99
Author: shubham <[email protected]>
Date: 2016-03-21T10:20:21Z
APEXMALHAR-2014 added parquet reader
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---