francisco-ixpantia opened a new issue, #14476:
URL: https://github.com/apache/arrow/issues/14476

   
   According to BigTable's documentation 
https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#parquetfiletocloudbigtable
 the necessary schema (in AVRO's format) is as follows:
   
   ```
   {
       "name" : "BigtableRow",
       "type" : "record",
       "namespace" : "com.google.cloud.teleport.bigtable",
       "fields" : [
         { "name" : "key", "type" : "bytes"},
         { "name" : "cells",
           "type" : {
             "type" : "array",
             "items": {
               "name": "BigtableCell",
               "type": "record",
               "fields": [
                 { "name" : "family", "type" : "string"},
                 { "name" : "qualifier", "type" : "bytes"},
                 { "name" : "timestamp", "type" : "long", "logicalType" : 
"timestamp-micros"},
                 { "name" : "value", "type" : "bytes"}
               ]
             }
           }
         }
      ]
   }
   ```
   
   This schema using R's Arrow library https://arrow.apache.org/docs/r/ is set 
up as follows:
   
   ```R
   library(arrow)
   
   bigtablecell <- struct(
     family = string(),
     qualifier = binary(),
     timestamp = timestamp(unit = "ms"),
     value = binary()
   )
   bigtablerow <- schema(key = binary(), cells = list_of(bigtablecell))
   
   bigtablecell_schema <- schema(bigtablecell = bigtablecell)
   ```
   
   The problem is in how to build from R a parquet file that fits this schema, 
to represent a single row the furthest I have come is the following:
   
   ```R
   bigtablecells_test <- Array$create(
     list(
       tibble(
         family = family,
         qualifier = Array$create("filter_id")$cast(binary())$as_vector(),
         timestamp = Array$create(1234567890L, type = 
int64())$cast(timestamp("ms"))$as_vector(),
         value = 
Array$create("0cd714fd-f6e8-4b76-aa16-1655b83e6148")$cast(binary())$as_vector()
       ),
       tibble(
         family = family,
         qualifier = Array$create("user_id")$cast(binary())$as_vector(),
         timestamp = Array$create(1234567891L, type = 
int64())$cast(timestamp("ms"))$as_vector(),
         value = Array$create("1655")$cast(binary())$as_vector()
       )
     )
   )$as_vector()
   
   key_test <- Array$create("1")$cast(binary())$as_vector()
   data <- tibble(key = key_test, cells = bigtablecells_test) # returns two 
rows!! 😢
   tab <- arrow_table(data, schema = bigtablerow)
   ```
   
   It happens that the sequence `Array$create(list(tibble(), tibble()))` is 
made in order to build the`list_of(bigtablecell)` part of the schema which is a 
`StructArray`, but in R I cannot create it directly as `StructArray$create()` 
but I have to rely on `Array$create()` to automatically detect the content by 
the value I pass as parameter.
   
   However, I have not been able to get the content I pass to `Array$create` to 
be properly interpreted as a `StructArray`, since when I convert it to a tibble 
it becomes two rows, instead of a single row with two columns. I have tried 
with Scalar, `ChunkedArray` in different combinations and I have not succeeded.
   
   Consequently, I request an example to define a `StructArray` from R. And any 
information you consider relevant to build parquet files from R. In addition to 
the official documentation I have followed in great detail the articles in 
Danielle Navarro's blog as 
https://blog.djnavarro.net/posts/2022-05-25_arrays-and-tables-in-arrow/ without 
finding examples for building `StructArray`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to