francisco-ixpantia opened a new issue, #14476: URL: https://github.com/apache/arrow/issues/14476
According to BigTable's documentation https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#parquetfiletocloudbigtable the necessary schema (in AVRO's format) is as follows: ``` { "name" : "BigtableRow", "type" : "record", "namespace" : "com.google.cloud.teleport.bigtable", "fields" : [ { "name" : "key", "type" : "bytes"}, { "name" : "cells", "type" : { "type" : "array", "items": { "name": "BigtableCell", "type": "record", "fields": [ { "name" : "family", "type" : "string"}, { "name" : "qualifier", "type" : "bytes"}, { "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"}, { "name" : "value", "type" : "bytes"} ] } } } ] } ``` This schema using R's Arrow library https://arrow.apache.org/docs/r/ is set up as follows: ```R library(arrow) bigtablecell <- struct( family = string(), qualifier = binary(), timestamp = timestamp(unit = "ms"), value = binary() ) bigtablerow <- schema(key = binary(), cells = list_of(bigtablecell)) bigtablecell_schema <- schema(bigtablecell = bigtablecell) ``` The problem is in how to build from R a parquet file that fits this schema, to represent a single row the furthest I have come is the following: ```R bigtablecells_test <- Array$create( list( tibble( family = family, qualifier = Array$create("filter_id")$cast(binary())$as_vector(), timestamp = Array$create(1234567890L, type = int64())$cast(timestamp("ms"))$as_vector(), value = Array$create("0cd714fd-f6e8-4b76-aa16-1655b83e6148")$cast(binary())$as_vector() ), tibble( family = family, qualifier = Array$create("user_id")$cast(binary())$as_vector(), timestamp = Array$create(1234567891L, type = int64())$cast(timestamp("ms"))$as_vector(), value = Array$create("1655")$cast(binary())$as_vector() ) ) )$as_vector() key_test <- Array$create("1")$cast(binary())$as_vector() data <- tibble(key = key_test, cells = bigtablecells_test) # returns two rows!! 😢 tab <- arrow_table(data, schema = bigtablerow) ``` It happens that the sequence `Array$create(list(tibble(), tibble()))` is made in order to build the`list_of(bigtablecell)` part of the schema which is a `StructArray`, but in R I cannot create it directly as `StructArray$create()` but I have to rely on `Array$create()` to automatically detect the content by the value I pass as parameter. However, I have not been able to get the content I pass to `Array$create` to be properly interpreted as a `StructArray`, since when I convert it to a tibble it becomes two rows, instead of a single row with two columns. I have tried with Scalar, `ChunkedArray` in different combinations and I have not succeeded. Consequently, I request an example to define a `StructArray` from R. And any information you consider relevant to build parquet files from R. In addition to the official documentation I have followed in great detail the articles in Danielle Navarro's blog as https://blog.djnavarro.net/posts/2022-05-25_arrays-and-tables-in-arrow/ without finding examples for building `StructArray`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
