Re: Any standard way for min/max values per record-batch?
Hello, Let me share our trial to support the min/max statistics per record batch. https://github.com/heterodb/pg-strom/wiki/806:-Apache-Arrow-Min-Max-Statistics-Hint The latest pg2arrow supports --stat option that can specify the columns to include min/max statistics for each record batch. The statistics are embedded in the custom_metadata[] of the Field (in the Footer area), if any. The example below shows database table dump with statistics on the lo_orderdate column. $ pg2arrow -d postgres -o /dev/shm/flineorder_sort.arrow -t lineorder_sort --stat=lo_orderdate --progress RecordBatch[0]: offset=1640 length=268437080 (meta=920, body=268436160) nitems=1303085 RecordBatch[1]: offset=268438720 length=268437080 (meta=920, body=268436160) nitems=1303085 : RecordBatch[9]: offset=2415935360 length=5566 (meta=920, body=55667968) nitems=270231 Then, you can find out the custom-metadata on the lo_orderdate field; min_values and max_values. These are comma separated integer lists with 10 elements as many as the number of record batches. So, the first item of min_values and max_values are min-/max-datum of the record-batch[0]. $ python3 Python 3.6.8 (default, Aug 24 2020, 17:57:11) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> X = pa.RecordBatchFileReader('/dev/shm/flineorder_sort.arrow') >>> X.schema lo_orderkey: decimal(30, 8) lo_linenumber: int32 lo_custkey: decimal(30, 8) lo_partkey: int32 lo_suppkey: decimal(30, 8) lo_orderdate: int32 -- field metadata -- min_values: '19920101,19920919,19930608,19940223,1994,19950730,1996' + 31 max_values: '19920919,19930608,19940223,1994,19950730,19960417,1997' + 31 lo_orderpriority: fixed_size_binary[15] lo_shippriority: fixed_size_binary[1] lo_quantity: decimal(30, 8) lo_extendedprice: decimal(30, 8) lo_ordertotalprice: decimal(30, 8) : When we scan the arrow_fdw foreign table that maps that Apache Arrow file with the above min/max statistics, it automatically checks WHERE-clause and skipps record-batches that contain no rows to be survived. postgres=# EXPLAIN ANALYZE SELECT count(*) FROM flineorder_sort WHERE lo_orderpriority = '2-HIGH' AND lo_orderdate BETWEEN 19940101 AND 19940630; QUERY PLAN -- Aggregate (cost=33143.09..33143.10 rows=1 width=8) (actual time=115.591..115.593 rows=1loops=1) -> Custom Scan (GpuPreAgg) on flineorder_sort (cost=33139.52..33142.07 rows=204 width=8) (actual time=115.580..115.585 rows=1 loops=1) Reduction: NoGroup Outer Scan: flineorder_sort (cost=4000.00..33139.42 rows=300 width=0) (actual time=10.682..21.555 rows=2606170 loops=1) Outer Scan Filter: ((lo_orderdate >= 19940101) AND (lo_orderdate <= 19940630) AND (lo_orderpriority = '2-HIGH'::bpchar)) Rows Removed by Outer Scan Filter: 2425885 referenced: lo_orderdate, lo_orderpriority Stats-Hint: (lo_orderdate >= 19940101), (lo_orderdate <= 19940630) [loaded: 2, skipped: 8] files0: /dev/shm/flineorder_sort.arrow (read: 217.52MB, size: 2357.11MB) Planning Time: 0.210 ms Execution Time: 153.508 ms (11 rows) This EXPLAIN ANALYZE displays the Stats-Hint line. It says Arrow_Fdw could use (lo_orderdate >= 19940101) and (lo_orderdate <= 19940630) to check the min/max statistics, then actually skipped 8 record-batches but only 2 record-batches were loaded. Our expectation is IoT/M2M grade time-series log-processing because they always contain timestamp values for each entry, and physically closed rows tend to have similar values. Not only Apache Arrow files generated by pg2arrow, this min/max statistics values are appendable by rewrite of the Footer portion, without relocation of record-batches. So, we plan to provide a standalone command to attach the min/max statistics onto the existing Apache Arrow generated by other tools. Best regards, 2021年2月18日(木) 13:33 Kohei KaiGai : > > Thanks for the clarification. > > > There is key-value metadata available on Message which might be able to > > work in the short term (some sort of encoded message). I think > > standardizing how we store statistics per batch does make sense. > > > For example, JSON array of min/max values as a key-value metadata > in the Footer->Schema->Fields[]->custom_metadata? > Even though the metadata field must be less than INT_MAX, I think it > is enough portable and not invasive way. > > > We unfortunately can't add anything to field-node without breaking > > compatibility. But another option would be to add a new structure as a > > parallel list on
Re: Any standard way for min/max values per record-batch?
Thanks for the clarification. > There is key-value metadata available on Message which might be able to > work in the short term (some sort of encoded message). I think > standardizing how we store statistics per batch does make sense. > For example, JSON array of min/max values as a key-value metadata in the Footer->Schema->Fields[]->custom_metadata? Even though the metadata field must be less than INT_MAX, I think it is enough portable and not invasive way. > We unfortunately can't add anything to field-node without breaking > compatibility. But another option would be to add a new structure as a > parallel list on RecordBatch itself. > > If we do add a new structure or arbitrary key-value pair we should not use > KeyValue but should have something where the values can be bytes. > What is the parallel-list means? If we would have a standardized binary structure, like DictionaryBatch, to store the statistics including min/max values, it exactly makes sense more than text-encoded key-value metadata, of course. Best regards, 2021年2月18日(木) 12:37 Micah Kornfield : > > There is key-value metadata available on Message which might be able to > work in the short term (some sort of encoded message). I think > standardizing how we store statistics per batch does make sense. > > We unfortunately can't add anything to field-node without breaking > compatibility. But another option would be to add a new structure as a > parallel list on RecordBatch itself. > > If we do add a new structure or arbitrary key-value pair we should not use > KeyValue but should have something where the values can be bytes. > > On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai wrote: > > > Hello, > > > > Does Apache Arrow have any standard way to embed min/max values of the > > fields > > per record-batch basis? > > It looks FieldNode supports neither dedicated min/max attribute nor > > custom-metadata. > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28 > > > > If we embed an array of min/max values into the custom-metadata of the > > Field-node, > > we may be able to implement. > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344 > > > > What I like to implement is something like BRIN index at PostgreSQL. > > http://heterodb.github.io/pg-strom/brin/ > > > > This index contains only min/max values for a particular block ranges, and > > query > > executor can skip blocks that obviously don't contain the target data. > > If we can skip 9990 of 1 record batch by checking metadata on a query > > that > > tries to fetch items in very narrow timestamps, it is a great > > acceleration more than > > full file scans. > > > > Best regards, > > -- > > HeteroDB, Inc / The PG-Strom Project > > KaiGai Kohei > > -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Any standard way for min/max values per record-batch?
Hello, Does Apache Arrow have any standard way to embed min/max values of the fields per record-batch basis? It looks FieldNode supports neither dedicated min/max attribute nor custom-metadata. https://github.com/apache/arrow/blob/master/format/Message.fbs#L28 If we embed an array of min/max values into the custom-metadata of the Field-node, we may be able to implement. https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344 What I like to implement is something like BRIN index at PostgreSQL. http://heterodb.github.io/pg-strom/brin/ This index contains only min/max values for a particular block ranges, and query executor can skip blocks that obviously don't contain the target data. If we can skip 9990 of 1 record batch by checking metadata on a query that tries to fetch items in very narrow timestamps, it is a great acceleration more than full file scans. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Pcap2Arrow - Packet capture and data conversion tool to Apache Arrow on the fly
Hello, Let me share my recent works below: https://github.com/heterodb/pg-strom/wiki/804:-Pcap2Arrow This standalone command-line tool allows to capture network packets from network interface devices, and convert them into Apache Arrow data format according to the pre-defined data schema for each supported protocol (TCP, UDP, ICMP x IPv4, IPv6), then write out the destination files. It internally uses PF_RING [*1] to support fast network interface card (> 10Gb), and to minimize packet losses by utilization of multi-core CPUs. Even though I confirmed that Pcap2Arrow write out the captured network packets more than 50Gb/s ratio, my test cases are artificial and biased traffic patterns. If you can test the software on your environment, it makes sense to improve the software. [*1] https://www.ntop.org/products/packet-capture/pf_ring/ As you may know, network traffic data tends to grow so large, thus, it is not easy to import them into database systems for analytics. Once we can convert them into Apache Arrow, we don't need to import the captured data again. Just map the files prior to analytics. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Re: Human-readable version of Arrow Schema?
Hello, pg2arrow [*1] has '--dump' mode to print out schema definition of the given Apache Arrow file. Does it make sense for you? $ ./pg2arrow --dump ~/hoge.arrow [Footer] {Footer: version=V4, schema={Schema: endianness=little, fields=[{Field: name="id", nullable=true, type={Int32}, children=[], custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64}, children=[], custom_metadata=[]}, {Field: name="b", nullable=true, type={Decimal: precision=11, scale=7}, children=[], custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct}, children=[{Field: name="x", nullable=true, type={Int32}, children=[], custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32}, children=[], custom_metadata=[]}, {Field: name="z", nullable=true, type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]}, {Field: name="d", nullable=true, type={Utf8}, dictionary={DictionaryEncoding: id=0, indexType={Int32}, isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e", nullable=true, type={Timestamp: unit=us}, children=[], custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8}, children=[], custom_metadata=[]}, {Field: name="random", nullable=true, type={Float64}, children=[], custom_metadata=[]}], custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random() FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184 bodyLength=128}], recordBatches=[{Block: offset=1232, metaDataLength=648 bodyLength=386112}]} [Dictionary Batch 0] {Block: offset=920, metaDataLength=184 bodyLength=128} {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch: length=6, nodes=[{FieldNode: length=6, null_count=0}], buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64}, {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128} [Record Batch 0] {Block: offset=1232, metaDataLength=648 bodyLength=386112} {Message: version=V4, body={RecordBatch: length=3000, nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=60}, {FieldNode: length=3000, null_count=62}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=0}], buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=12032}, {Buffer: offset=12032, length=384}, {Buffer: offset=12416, length=24000}, {Buffer: offset=36416, length=384}, {Buffer: offset=36800, length=48000}, {Buffer: offset=84800, length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184, length=12032}, {Buffer: offset=97216, length=384}, {Buffer: offset=97600, length=12032}, {Buffer: offset=109632, length=0}, {Buffer: offset=109632, length=12032}, {Buffer: offset=121664, length=96000}, {Buffer: offset=217664, length=0}, {Buffer: offset=217664, length=12032}, {Buffer: offset=229696, length=384}, {Buffer: offset=230080, length=24000}, {Buffer: offset=254080, length=0}, {Buffer: offset=254080, length=12032}, {Buffer: offset=266112, length=96000}, {Buffer: offset=362112, length=0}, {Buffer: offset=362112, length=24000}]}, bodyLength=386112} [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow 2019年12月7日(土) 6:26 Christian Hudon : > > Hi, > > For the uses I would like to make of Arrow, I would need a human-readable > and -writable version of an Arrow Schema, that could be converted to and > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't > see anything to that effect, with the closest being the ToString() method > on DataType instances, but which is meant for debugging only. (I need an > expression of an Arrow Schema that people can read, and that can live > outside of the code for a particular operation.) > > Is a text representation of an Arrow Schema something that is being worked > on now? If not, would you folks be interested in me putting up an initial > proposal for discussion? Any design constraints I should pay attention to, > then? > > Thanks, > > Christian > -- > > > │ Christian Hudon > > │ Applied Research Scientist > >Element AI, 6650 Saint-Urbain #500 > >Montréal, QC, H2S 3G9, Canada >Elementai.com -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Re: How about inet4/inet6/macaddr data types?
Hello Wes, @ktou also introduced me your work. As long as the custom_metadata format to declare the custom datatype is well defined in the specification or document somewhere, independent from the library implementation, it looks to me sufficient. Does your UUID example use FixedSizeBinary raw-data type to wrap UUID and put "arrow_extension_name=uuid" and "arrow_extension_data=uuid-type-unique-code" on the custrom_metadata of Field "f0", right? If it is documented somewhere, people can reproduce the custom datatype by their applications, and other folks can also read the custom datatype. Thanks, 2019年4月30日(火) 23:47 Wes McKinney : > > hi Kohei, > > Since the introduction of arrow::ExtensionType in ARROW-585 [1] we > have a well-defined method of creating new data types without having > to manually interact with the custom_metadata Schema information. Can > you have a look at that and see if it meets your requirements? This > can be a useful way of extending the Arrow format for your use cases > while the community may discuss formally adding new logical types to > the format (or not). > > In the unit tests you can see a UUID type I have defined and > serialized through Arrow's binary protocol machinery > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc > > Thanks > Wes > > [1]: > https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1 > > On Tue, Apr 30, 2019 at 1:34 AM Kohei KaiGai wrote: > > > > Hello, > > > > It is an proposition to add new logical types for the Apache Arrow data > > format. > > > > As Melik-Adamyan said, it is quite easy to convert 5-bytes > > FixedSizeBinary to PostgreSQL's inet > > data type by the Arrow_Fdw module (an extension of PostgreSQL > > responsible to data conversion), > > however, it is not obvious for readers whether it is network-address > > or just a bunch of small binary. > > > > https://www.postgresql.org/docs/11/sql-importforeignschema.html > > PostgreSQL has IMPORT FOREIGN SCHEMA command; that allows to define a > > foreign table > > according to schema information of the external data source. > > In case of Arrow_Fdw, we can define a foreign table without manual > > listing of columns with data > > types as follows: > > > > IMPORT FOREIGN SCHEMA foo FROM arrow_fdw INTO public > > OPTIONS (file '/opt/nvme/foo.arrow'); > > > > In this case, Schema definition in the 'foo.arrow' can tell PostgreSQL > > how many columns are > > defined and its name, data types and so on. However, PostgreSQL may be > > confusing to convert > > the FixedSizeBinary (width=5) without any metadata support. It may be > > 'inet4' data type, and > > it also may be 'char(5)'. > > > > One idea is utilization of custom_metadata field in the Field-node. We > > may be able to mark it is > > a network address, not a blob. However, I didn't find out > > specification of the custom_metadata. > > > > I expect network address is widely used for log-data processing area, > > and not small number of > > application will support it. If so, it is not too niche requirement > > for a new logical data type definition > > in the Apache Arrow data format. > > > > Best regards, > > > > 2019年4月30日(火) 15:13 Micah Kornfield : > > > > > > Hi KaiGai Kohei, > > > Can you clarify if you are looking for advice on modelling these types or > > > proposing to add new logical types to the Arrow specification? > > > > > > Thanks, > > > Micah > > > > > > On Monday, April 29, 2019, Kohei KaiGai wrote: > > > > > > > Hello folks, > > > > > > > > How about your opinions about network address types support in Apache > > > > Arrow data format? > > > > Network address always appears at network logs massively generated by > > > > any network facilities, > > > > and it is a significant information when people analyze their backward > > > > logs. > > > > > > > > I'm working on Apache Arrow format mapping on PostgreSQL. > > > > http://heterodb.github.io/pg-strom/arrow_fdw/ > > > > > > > > This extension allows to read Arrow files as if PostgreSQL's table > > > > using foreign table. > > > > Data types of Arrow shall be mapped to relevant PostgreSQL's data type > > > > according to the above > > > > documentation. > > > > > > > > https://www.postgresql.org/docs/current/datatype-net-types.
Re: How about inet4/inet6/macaddr data types?
Hello, It is an proposition to add new logical types for the Apache Arrow data format. As Melik-Adamyan said, it is quite easy to convert 5-bytes FixedSizeBinary to PostgreSQL's inet data type by the Arrow_Fdw module (an extension of PostgreSQL responsible to data conversion), however, it is not obvious for readers whether it is network-address or just a bunch of small binary. https://www.postgresql.org/docs/11/sql-importforeignschema.html PostgreSQL has IMPORT FOREIGN SCHEMA command; that allows to define a foreign table according to schema information of the external data source. In case of Arrow_Fdw, we can define a foreign table without manual listing of columns with data types as follows: IMPORT FOREIGN SCHEMA foo FROM arrow_fdw INTO public OPTIONS (file '/opt/nvme/foo.arrow'); In this case, Schema definition in the 'foo.arrow' can tell PostgreSQL how many columns are defined and its name, data types and so on. However, PostgreSQL may be confusing to convert the FixedSizeBinary (width=5) without any metadata support. It may be 'inet4' data type, and it also may be 'char(5)'. One idea is utilization of custom_metadata field in the Field-node. We may be able to mark it is a network address, not a blob. However, I didn't find out specification of the custom_metadata. I expect network address is widely used for log-data processing area, and not small number of application will support it. If so, it is not too niche requirement for a new logical data type definition in the Apache Arrow data format. Best regards, 2019年4月30日(火) 15:13 Micah Kornfield : > > Hi KaiGai Kohei, > Can you clarify if you are looking for advice on modelling these types or > proposing to add new logical types to the Arrow specification? > > Thanks, > Micah > > On Monday, April 29, 2019, Kohei KaiGai wrote: > > > Hello folks, > > > > How about your opinions about network address types support in Apache > > Arrow data format? > > Network address always appears at network logs massively generated by > > any network facilities, > > and it is a significant information when people analyze their backward > > logs. > > > > I'm working on Apache Arrow format mapping on PostgreSQL. > > http://heterodb.github.io/pg-strom/arrow_fdw/ > > > > This extension allows to read Arrow files as if PostgreSQL's table > > using foreign table. > > Data types of Arrow shall be mapped to relevant PostgreSQL's data type > > according to the above > > documentation. > > > > https://www.postgresql.org/docs/current/datatype-net-types.html > > PostgreSQL supports some network address types and operators. > > For example, we can put a qualifier like: WHERE addr <<= inet > > '192.168.1.0/24' , to find out all > > the records in the subnet of '192.168.1.0/24'. > > > > Probably, these three data types are now sufficient for most network > > logs: inet4, inet6 and macaddr. > > * inet4 is 32bit + optional 8bit (for netmask) fixed length array > > * inet6 is 128bit + optional 8bit (for netmask) fixed length array > > * macaddr is 48bit fixed length array. > > > > I don't favor to map the inetX types on flexible length Binary data > > type, because it takes 32bit offset > > to indicate 32 or 40bit value, inefficient so much, even though > > PostgreSQL allows to mix inet4/inet6 > > data types in a same column. > > > > Thanks, > > -- > > HeteroDB, Inc / The PG-Strom Project > > KaiGai Kohei > > -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
How about inet4/inet6/macaddr data types?
Hello folks, How about your opinions about network address types support in Apache Arrow data format? Network address always appears at network logs massively generated by any network facilities, and it is a significant information when people analyze their backward logs. I'm working on Apache Arrow format mapping on PostgreSQL. http://heterodb.github.io/pg-strom/arrow_fdw/ This extension allows to read Arrow files as if PostgreSQL's table using foreign table. Data types of Arrow shall be mapped to relevant PostgreSQL's data type according to the above documentation. https://www.postgresql.org/docs/current/datatype-net-types.html PostgreSQL supports some network address types and operators. For example, we can put a qualifier like: WHERE addr <<= inet '192.168.1.0/24' , to find out all the records in the subnet of '192.168.1.0/24'. Probably, these three data types are now sufficient for most network logs: inet4, inet6 and macaddr. * inet4 is 32bit + optional 8bit (for netmask) fixed length array * inet6 is 128bit + optional 8bit (for netmask) fixed length array * macaddr is 48bit fixed length array. I don't favor to map the inetX types on flexible length Binary data type, because it takes 32bit offset to indicate 32 or 40bit value, inefficient so much, even though PostgreSQL allows to mix inet4/inet6 data types in a same column. Thanks, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Re: Format specification document?
Hello McKinney, After the post of my first message, I could find out a significant documentation here: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example Unlike my expectation, flatbuffer mechanism has much different structure on-memory image. So, let's review the Apache Arrow file binary according to the documentation... 00 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 10 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 20 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 The first 8bytes are signature of "ARROW1\0\0\0", then following 4bytes are length of the metadata regardless of the flatbuffer. Then, we could fetch 0x0010(int) at 0x000c. It indicates 0x000c + 0x0010 is the root table. A int value at 0x001c is 0x000a. It means 0x001c - 0x000a = 0x0012 begins vtable structure. 0x0012 0a 00 --> vtable length = 10bytes (5 items) 0x0014 0e 00 --> table length = 14 bytes; including the negative offset (4bytes) 0x0016 06 00 --> table 0x001c + 0x0006 is metadata version (short) 0x0018 05 00 --> table 0x001c + 0x0005 is message header (byte) 0x001a 08 00 --> table 0x001c + 0x0008 is header offset (int) 0x001c 0a 00 00 00 --> negative offset to the vtable So, we can know this file contains Apache Arrow V4 format, then header begins from at 0x0024 + 0x0010. 20 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 30 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 Next, 0x0034 is position of the current table. It indicates 0x0034 - 0x000a is vtable. 0x002a 0a 00 --> vtable length = 10bytes (5items) 0x002c 0c 00 --> table length = 14bytes; including the negative offset (4bytes) 0x002e 00 00 --> Schema::endianness is default (0 = little endian) 0x0030 04 00 --> Schema::fields[] 0x0032 08 00 --> Schema::custom_metadata[] It says Schema::fields[] begins at 0x0038 + 0x03ec = 0x0424, and also says Schema::custom_metadata[] begins at 0x003a + 0x0004 = 0x0040. >From 0x0040: 40 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 50 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 60 70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61 70 6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22 80 30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e 90 73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22 a0 3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74 b0 79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22 c0 6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69 d0 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c The binary from 0x0060 is a cstring ("pandas\0"), and the binary from 0x006c is also a cstring of JSON. The location indicated by 0x0040 has number of vector element. So, this metadata contains one key-value pair. Next int word indicates the sub-table at 0x0050. Its vtable is below: 0x0048 08 00 --> vtable length = 8bytes (4items) 0x004a 0c 00 --> table length = 12bytes; including the negative offset (4bytes) 0x004c 04 00 --> cstring offset (key) is at 0x0050 + 0x0004 0x004e 08 00 --> cstring offset (value) is at 0x0050 + 0x0008 Key is at 0x0054 + 0x0008. Here is a int value: 0x0006. It means cstring length is 6bytes and the next byte (0x0060) begins the cstring body. ("pandas\0"). Value is at 0x0058 + 0x0010. Here is a int value: 0x03b4 (= 948byes), then the next byte (0x006c) begins the cstring body. ("{pandas_version ... ). I didn't follow the entire data file, however, it makes me more clear. Best regards, 2019年1月6日(日) 8:50 Wes McKinney : > > hi Kohei, > > On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai wrote: > > > > Hello, > > > > I'm now trying to understand the Apache Arrow format for my application. > > Is there a format specification document including meta-data layout? > > > > I checked out the description at: > > https://github.com/apache/arrow/tree/master/docs/source/format > > https://github.com/apache/arrow/tree/master/format > > > > The format/IPC.rst says an arrow file has the format below: > > > > > > > > > > > > > > > > > > Then, STREAMING FORMAT begins from SCHEMA-message. > > The message chunk has the format below: > > > > > > > > > > > > > > I made an arrow file using pyarrow [*1]. It has the following binary. > > > > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16 > > 00 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 > > 10 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 > > 20 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 > > 30 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 > > 40 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 > > 50 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 > > 60 70 61 6e 64 61 73 00 00
Format specification document?
Hello, I'm now trying to understand the Apache Arrow format for my application. Is there a format specification document including meta-data layout? I checked out the description at: https://github.com/apache/arrow/tree/master/docs/source/format https://github.com/apache/arrow/tree/master/format The format/IPC.rst says an arrow file has the format below: Then, STREAMING FORMAT begins from SCHEMA-message. The message chunk has the format below: I made an arrow file using pyarrow [*1]. It has the following binary. [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16 00 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 10 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 20 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 30 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 40 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 50 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 60 70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61 70 6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22 80 30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e 90 73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22 a0 3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74 b0 79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22 c0 6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69 d0 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c e0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 f0 22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61 The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420) that is reasonable for SCHEMA-message length. The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer. The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip next 16bytes, expecting message body begins at 0x20. However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03 (= RecordBatch, not Schema!), and the following 64bit is 0x0a0010(!). It is obviously I'm understanding incorrectly. Is there documentation stuff to introduce detailed layout of the arrow format? Thanks, [*1] Steps to make a sample arrow file $ python3.5 >>> import pyarrow as pa >>> import pandas as pd >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", >>> con="postgresql://localhost/postgres") >>> Y = pa.Table.from_pandas(X) >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema) >>> f.write_table(Y) >>> f.close() -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei