Hi Wes,
Thanks,
*[ Part 1 ]*
*C++ HDFS/ORC [Completed]*
Steps which I followed :
1) arrow::fs::HadoopFileSystem --> create a hadoop FS
2) std::shared_ptr<io::RandomAccessFile> -->then create a stream
3) Pass that stream to adapters::orc::ORCFileReader
*[Part 2 ]*
*C++ HDFS/ORC via Java JNI [Partial Completed]*
*Follow same approach in orc.jni_wrapper*
1) arrow::fs::HadoopFileSystem --> create a hadoop FS
2) std::shared_ptr<io::RandomAccessFile> -->then create a stream
3) Pass that stream to adapters::orc::ORCFileReader
*<jni snippet>*
std::unique_ptr<ORCFileReader> reader;
arrow::Status ret;
if (path.find("hdfs://") == 0) {
* arrow::fs::HdfsOptions options_; options_ =
*arrow::fs::HdfsOptions::FromUri(path); auto _fsRes =
arrow::fs::HadoopFileSystem::Make(options_); if (!_fsRes.ok()) {
std::cerr << "HadoopFileSystem::Make failed, it
is possible when we don't have " "proper driver on
this node, err msg is " << _fsRes.status().ToString();
} _fs = *_fsRes; auto _stream =
*_fs->OpenInputFile(path); hadoop_fs_holder_.Insert(_fs); //global
holder in arrow::jni::ConcurrentMap, cleared during unload ret =
ORCFileReader::Open( * *_stream*
*, arrow::default_memory_pool(), &reader);*
* if (!ret.ok()) { env->ThrowNew(io_exception_class,
std::string("Failed open file" + path).c_str()); }*
* return
orc_reader_holder_.Insert(std::shared_ptr<ORCFileReader>(reader.release()));*
*}*
JNI also works fine, but at the end of application, I am getting
segmentation fault.
*Do you have any idea about , looks like some issue with libhdfs connection
close or cleanup ?*
*stack trace:*
/tmp/tmp3973555041947319188libarrow_orc_jni.so : ()+0xb8b1a3
/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0
/lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb
/lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x90e769
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0xad3803
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: JVM_handle_linux_signal()+0x1a5
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x90b8b8
/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x8cac27
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x8cc50b
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0xada661
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x6c237d
*
/home/legion/ha_devel/hadoop-ecosystem-3x/hadoop-3.1.1/lib/native/libhdfs.so
: ()+0xaa4f*
/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x85a1
/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x962a
/lib/x86_64-linux-gnu/libc.so.6 : clone()+0x43
On Wed, 8 Sept 2021 at 04:07, Weston Pace <[email protected]> wrote:
> I'll just add that a PR in in progress (thanks Joris!) for adding this
> adapter: https://github.com/apache/arrow/pull/10991
>
> On Tue, Sep 7, 2021 at 12:05 PM Wes McKinney <[email protected]> wrote:
> >
> > I'm missing context but if you're talking about C++/Python, we are
> > currently missing a wrapper interface to the ORC reader in the Arrow
> > datasets library
> >
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset
> >
> > We have CSV, Arrow (IPC), and Parquet interfaces.
> >
> > But we have an HDFS filesystem implementation and an ORC reader
> > implementation, so mechanically all of the pieces are there but need
> > to be connected together.
> >
> > Thanks,
> > Wes
> >
> > On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar <[email protected]> wrote:
> > >
> > > Hi Dev-Community,
> > >
> > > Anyone can help me to guide how to read ORC directly from HDFS to an
> > > arrow dataset.
> > >
> > > Thanks
> > > Manoj
>