With open_stream I get a different error: > python -c "import pyarrow; pyarrow.ipc.open_stream('/tmp/foo')" Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137, in open_stream return RecordBatchStreamReader(source) File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in __init__ self._open(source) File "pyarrow/ipc.pxi", line 352, in pyarrow.lib._RecordBatchStreamReader._open File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Expected to read 1886221359 metadata bytes, but only read 4
On Mon, Jun 15, 2020 at 10:08 PM Wes McKinney <wesmck...@gmail.com> wrote: > On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <rvern...@gmail.com> wrote: > > > > I was able to reproduce my issue in a small, fully-contained, program. > Here > > is the source code: > > > > #include <arrow/builder.h> > > #include <arrow/io/file.h> > > #include <arrow/ipc/writer.h> > > #include <arrow/record_batch.h> > > > > arrow::Status foo() { > > std::shared_ptr<arrow::io::OutputStream> arrowStream; > > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; > > std::shared_ptr<arrow::RecordBatch> arrowBatch; > > std::shared_ptr<arrow::RecordBatchReader> arrowReader; > > > > std::vector<std::shared_ptr<arrow::Field>> arrowFields(2); > > arrowFields[0] = arrow::field("foo", arrow::int64()); > > arrowFields[1] = arrow::field("bar", arrow::int64()); > > std::shared_ptr<arrow::Schema> arrowSchema = > arrow::schema(arrowFields); > > > > std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2); > > arrow::Int64Builder arrowBuilder; > > for (int i = 0; i < 2; i++) { > > for (int j = 0; j < 21; j++) > > if (i && (j % 2)) > > arrowBuilder.AppendNull(); > > else > > arrowBuilder.Append(j); > > ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i])); > > } > > arrowBatch = arrow::RecordBatch::Make(arrowSchema, > > arrowArrays[0]->length(), arrowArrays); > > > > ARROW_ASSIGN_OR_RAISE(arrowStream, > > arrow::io::FileOutputStream::Open("/tmp/foo")); > > ARROW_ASSIGN_OR_RAISE(arrowWriter, > > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema)); > > ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch)); > > ARROW_RETURN_NOT_OK(arrowWriter->Close()); > > ARROW_RETURN_NOT_OK(arrowStream->Close()); > > > > return arrow::Status::OK(); > > } > > > > int main() { > > foo(); > > } > > > > I compile and run it like this: > > > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo > > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo > > > > The file is small and I can't read it from PyArrow: > > > > > python -c "import pyarrow; > > pyarrow.ipc.open_file('/tmp/foo').read_pandas()" > > Here is your problem. Try `pyarrow.ipc.open_stream`. > > > Traceback (most recent call last): > > File "<string>", line 1, in <module> > > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 156, > > in open_file > > return RecordBatchFileReader(source, footer_offset=footer_offset) > > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 99, > in > > __init__ > > self._open(source, footer_offset=footer_offset) > > File "pyarrow/ipc.pxi", line 398, in > > pyarrow.lib._RecordBatchFileReader._open > > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > > pyarrow.lib.ArrowInvalid: File is too small: 8 > > > > Here is the Arrow and G++ version: > > > > > dpkg -s libarrow-dev > > Package: libarrow-dev > > Status: install ok installed > > Priority: optional > > Section: libdevel > > Installed-Size: 38738 > > Maintainer: Apache Arrow Developers <dev@arrow.apache.org> > > Architecture: amd64 > > Multi-Arch: same > > Source: apache-arrow > > Version: 0.17.1-1 > > Depends: libarrow17 (= 0.17.1-1) > > > > > g++ --version > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > > > Does this make sense? > > > > Cheers, > > Rares > > > > > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <rvern...@gmail.com> > wrote: > > > > > This is the compiler: > > > > > > > g++ --version > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > > > > > And this is how I compile the code: > > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC > -D_STDC_FORMAT_MACROS > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > > > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > > > -I"/opt/scidb/19.11/3rdparty/boost/include/" > -I"/opt/scidb/19.11/include" > > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o > > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC > -D_STDC_FORMAT_MACROS > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > > > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > > > -I"/opt/scidb/19.11/3rdparty/boost/include/" > -I"/opt/scidb/19.11/include" > > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o PhysicalSplit.o > > > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o > > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared > > > -Wl,-soname,libaccelerated_io_tools.so -L. > > > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib" > > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow > > > > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and PyPI > still > > > has PyArrow binaries for 2.7. > > > > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the same. I > > > also fixed all the deprecation warnings but that did not help either. > > > > > > Setting a breakpoint might be a challenge since this code runs as a > > > plug-in, but I'll try to isolate this further. > > > > > > Thanks! > > > Rares > > > > > > > > > > > > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > >> What compiler are you using? > > >> > > >> In 0.16.0 (what you said you were targeting, though it would be better > > >> for you to upgrade to 0.17.1) schema is written in the CheckStarted > > >> function here > > >> > > >> > > >> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972 > > >> > > >> Status CheckStarted() { > > >> if (!started_) { > > >> return Start(); > > >> } > > >> return Status::OK(); > > >> } > > >> > > >> started_ is set to false by a default member initializer in the > > >> protected block. Maybe you should set a breakpoint in this function > > >> and see if for some reason started_ is true on the first invocation > > >> (in which case it makes me wonder if there is something > > >> not-fully-C++11-compliant about your toolchain). > > >> > > >> Otherwise I'm a bit stumped since there are lots of production > > >> applications that use this code. > > >> > > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <rvern...@gmail.com> > > >> wrote: > > >> > > > >> > Sure, here is briefly what I'm doing: > > >> > > > >> > bool append = false; > > >> > std::shared_ptr<arrow::io::OutputStream> arrowStream; > > >> > auto arrowResult = arrow::io::FileOutputStream::Open(fileName, > > >> append); > > >> > arrowStream = arrowResult.ValueOrDie(); > > >> > > > >> > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; > > >> > std::shared_ptr<arrow::RecordBatch> arrowBatch; > > >> > std::shared_ptr<arrow::RecordBatchReader> arrowReader; > > >> > > > >> > std::shared_ptr<arrow::Schema> arrowSchema = > attributes2ArrowSchema( > > >> > inputSchema, settings.isAttsOnly()); > > >> > ARROW_RETURN_NOT_OK( > > >> > arrow::ipc::RecordBatchStreamWriter::Open( > > >> > arrowStream.get(), arrowSchema, &arrowWriter)); > > >> > > > >> > // Setup "arrowReader" using BufferReader and > > >> RecordBatchStreamReader > > >> > ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch)); > > >> > ARROW_RETURN_NOT_OK( > > >> > arrowWriter->WriteRecordBatch(*arrowBatch)); > > >> > ARROW_RETURN_NOT_OK(arrowWriter->Close()); > > >> > ARROW_RETURN_NOT_OK(arrowStream->Close()); > > >> > > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <wesmck...@gmail.com> > > >> wrote: > > >> > > > >> > > Can you show the code you are writing? The first thing the stream > > >> writer > > >> > > does before writing any record batch is write the schema. It > sounds > > >> like > > >> > > you are using arrow::ipc::WriteRecordBatch somewhere. > > >> > > > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <rvern...@gmail.com> > > >> wrote: > > >> > > > > >> > > > Hello, > > >> > > > > > >> > > > I have a RecordBatch that I would like to write to a file. I'm > using > > >> > > > FileOutputStream::Open to open the file and > > >> RecordBatchStreamWriter::Open > > >> > > > to open the stream. I write a record batch with > WriteRecordBatch. > > >> > > Finally, > > >> > > > I close the RecordBatchWriter and OutputStream. > > >> > > > > > >> > > > The resulting file size is exactly the size of the Buffer used > to > > >> store > > >> > > the > > >> > > > RecordBatch. It looks like it is missing the schema. When I try > to > > >> open > > >> > > the > > >> > > > resulting file from PyArrow I get: > > >> > > > > > >> > > > >>> pa.ipc.open_file('/tmp/1') > > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6 > > >> > > > > > >> > > > $ ll /tmp/1 > > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1 > > >> > > > > > >> > > > How can I write the schema as well? > > >> > > > > > >> > > > I was browsing the documentation at > > >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't locate > > >> any C++ > > >> > > > documentation about RecordBatchStreamWriter or > RecordBatchWriter. > > >> Is this > > >> > > > intentional? > > >> > > > > > >> > > > Thank you! > > >> > > > Rares > > >> > > > > > >> > > > > >> > > > >