I was able to reproduce my issue in a small, fully-contained, program. Here is the source code:
#include <arrow/builder.h> #include <arrow/io/file.h> #include <arrow/ipc/writer.h> #include <arrow/record_batch.h> arrow::Status foo() { std::shared_ptr<arrow::io::OutputStream> arrowStream; std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; std::shared_ptr<arrow::RecordBatch> arrowBatch; std::shared_ptr<arrow::RecordBatchReader> arrowReader; std::vector<std::shared_ptr<arrow::Field>> arrowFields(2); arrowFields[0] = arrow::field("foo", arrow::int64()); arrowFields[1] = arrow::field("bar", arrow::int64()); std::shared_ptr<arrow::Schema> arrowSchema = arrow::schema(arrowFields); std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2); arrow::Int64Builder arrowBuilder; for (int i = 0; i < 2; i++) { for (int j = 0; j < 21; j++) if (i && (j % 2)) arrowBuilder.AppendNull(); else arrowBuilder.Append(j); ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i])); } arrowBatch = arrow::RecordBatch::Make(arrowSchema, arrowArrays[0]->length(), arrowArrays); ARROW_ASSIGN_OR_RAISE(arrowStream, arrow::io::FileOutputStream::Open("/tmp/foo")); ARROW_ASSIGN_OR_RAISE(arrowWriter, arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema)); ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch)); ARROW_RETURN_NOT_OK(arrowWriter->Close()); ARROW_RETURN_NOT_OK(arrowStream->Close()); return arrow::Status::OK(); } int main() { foo(); } I compile and run it like this: > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo The file is small and I can't read it from PyArrow: > python -c "import pyarrow; pyarrow.ipc.open_file('/tmp/foo').read_pandas()" Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 156, in open_file return RecordBatchFileReader(source, footer_offset=footer_offset) File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 99, in __init__ self._open(source, footer_offset=footer_offset) File "pyarrow/ipc.pxi", line 398, in pyarrow.lib._RecordBatchFileReader._open File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: File is too small: 8 Here is the Arrow and G++ version: > dpkg -s libarrow-dev Package: libarrow-dev Status: install ok installed Priority: optional Section: libdevel Installed-Size: 38738 Maintainer: Apache Arrow Developers <dev@arrow.apache.org> Architecture: amd64 Multi-Arch: same Source: apache-arrow Version: 0.17.1-1 Depends: libarrow17 (= 0.17.1-1) > g++ --version g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 Does this make sense? Cheers, Rares On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <rvern...@gmail.com> wrote: > This is the compiler: > > > g++ --version > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > And this is how I compile the code: > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > -I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include" > -c PhysicalAioSave.cpp -o PhysicalAioSave.o > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > -I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include" > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o PhysicalSplit.o > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared > -Wl,-soname,libaccelerated_io_tools.so -L. > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib" > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow > > We targeted 0.16.0 because we are still stuck on Python 2.7 and PyPI still > has PyArrow binaries for 2.7. > > Anyway, I temporarily upgraded to 0.17.1 but the result is the same. I > also fixed all the deprecation warnings but that did not help either. > > Setting a breakpoint might be a challenge since this code runs as a > plug-in, but I'll try to isolate this further. > > Thanks! > Rares > > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> What compiler are you using? >> >> In 0.16.0 (what you said you were targeting, though it would be better >> for you to upgrade to 0.17.1) schema is written in the CheckStarted >> function here >> >> >> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972 >> >> Status CheckStarted() { >> if (!started_) { >> return Start(); >> } >> return Status::OK(); >> } >> >> started_ is set to false by a default member initializer in the >> protected block. Maybe you should set a breakpoint in this function >> and see if for some reason started_ is true on the first invocation >> (in which case it makes me wonder if there is something >> not-fully-C++11-compliant about your toolchain). >> >> Otherwise I'm a bit stumped since there are lots of production >> applications that use this code. >> >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <rvern...@gmail.com> >> wrote: >> > >> > Sure, here is briefly what I'm doing: >> > >> > bool append = false; >> > std::shared_ptr<arrow::io::OutputStream> arrowStream; >> > auto arrowResult = arrow::io::FileOutputStream::Open(fileName, >> append); >> > arrowStream = arrowResult.ValueOrDie(); >> > >> > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; >> > std::shared_ptr<arrow::RecordBatch> arrowBatch; >> > std::shared_ptr<arrow::RecordBatchReader> arrowReader; >> > >> > std::shared_ptr<arrow::Schema> arrowSchema = attributes2ArrowSchema( >> > inputSchema, settings.isAttsOnly()); >> > ARROW_RETURN_NOT_OK( >> > arrow::ipc::RecordBatchStreamWriter::Open( >> > arrowStream.get(), arrowSchema, &arrowWriter)); >> > >> > // Setup "arrowReader" using BufferReader and >> RecordBatchStreamReader >> > ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch)); >> > ARROW_RETURN_NOT_OK( >> > arrowWriter->WriteRecordBatch(*arrowBatch)); >> > ARROW_RETURN_NOT_OK(arrowWriter->Close()); >> > ARROW_RETURN_NOT_OK(arrowStream->Close()); >> > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> > > Can you show the code you are writing? The first thing the stream >> writer >> > > does before writing any record batch is write the schema. It sounds >> like >> > > you are using arrow::ipc::WriteRecordBatch somewhere. >> > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <rvern...@gmail.com> >> wrote: >> > > >> > > > Hello, >> > > > >> > > > I have a RecordBatch that I would like to write to a file. I'm using >> > > > FileOutputStream::Open to open the file and >> RecordBatchStreamWriter::Open >> > > > to open the stream. I write a record batch with WriteRecordBatch. >> > > Finally, >> > > > I close the RecordBatchWriter and OutputStream. >> > > > >> > > > The resulting file size is exactly the size of the Buffer used to >> store >> > > the >> > > > RecordBatch. It looks like it is missing the schema. When I try to >> open >> > > the >> > > > resulting file from PyArrow I get: >> > > > >> > > > >>> pa.ipc.open_file('/tmp/1') >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6 >> > > > >> > > > $ ll /tmp/1 >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1 >> > > > >> > > > How can I write the schema as well? >> > > > >> > > > I was browsing the documentation at >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't locate >> any C++ >> > > > documentation about RecordBatchStreamWriter or RecordBatchWriter. >> Is this >> > > > intentional? >> > > > >> > > > Thank you! >> > > > Rares >> > > > >> > > >> >