[GitHub] [arrow-datafusion] sarutak opened a new pull request, #7525: Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records

via GitHub Mon, 11 Sep 2023 12:28:04 -0700


sarutak opened a new pull request, #7525:
URL: https://github.com/apache/arrow-datafusion/pull/7525


   ## Which issue does this PR close?
   
   Closes #7524 
   
   ## Rationale for this change
   This PR fixes an issue that I explained #7524.
   
   ## What changes are included in this PR?
   The causes are:
   
   1. `schema_lookup` considers the `lookup` table only for root record. Child 
records have their own `lookup` table so they should be considered too.
   2. The logic for reading arrays of records are wrong.
   
   So, this change includes fixes for them.
   
   ## Are these changes tested?
   I prepared [this Avro format 
file](https://github.com/sarutak/arrow-testing/blob/nested-records/data/avro/nested_records.avro)
 for test.
   The schema of this file is as follows.
   ```
   {
       "name": "record1",
       "namespace": "ns1",
       "type": "record",
       "fields": [
           {
               "name": "f1",
               "type": {
                   "name": "record2",
                   "namespace": "ns2",
                   "type": "record",
                   "fields": [
                       {
                           "name": "f1_1",
                           "type": "string"
                       },  {
                           "name": "f1_2",
                           "type": "int"
                       },  {
                           "name": "f1_3",
                           "type": {
                               "name": "record3",
                               "namespace": "ns3",
                               "type": "record",
                               "fields": [
                                   {
                                       "name": "f1_3_1",
                                       "type": "double"
                                   }
                               ]
                           }
                       }
                   ]
               }
           },  {
               "name": "f2",
               "type": "array",
               "items": {
                   "name": "record4",
                   "namespace": "ns4",
                   "type": "record",
                   "fields": [
                       {
                           "name": "f2_1",
                           "type": "boolean"
                       },  {
                           "name": "f2_2",
                           "type": "float"
                       }
                   ]
               }
           }
       ]
   }
   ```
   
   And the JSON representation of the Avro format file is as follows.
   ```
   
{"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2},{"f2_1":true,"f2_2":2.2}]}
   
{"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.2}]}
   ```
   
   Using this data, I create a table and scan it.
   ```
   CREATE EXTERNAL TABLE mytbl STORED AS AVRO LOCATION 
'/path/to/nested_records.avro';
   SELECT * FROM mytbl;
   
   
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
   | f1                                                                         
                 | f2                                                           
                                      |
   
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
   | {ns2.record2.f1_1: aaa, ns2.record2.f1_2: 10, ns2.record2.f1_3: 
{ns3.record3.f1_3_1: 3.14}} | [{ns4.record4.f2_1: true, ns4.record4.f2_2: 1.2}, 
{ns4.record4.f2_1: true, ns4.record4.f2_2: 2.2}] |
   | {ns2.record2.f1_1: bbb, ns2.record2.f1_2: 20, ns2.record2.f1_3: 
{ns3.record3.f1_3_1: 3.14}} | [{ns4.record4.f2_1: false, ns4.record4.f2_2: 
10.2}]                                                |
   
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.006 seconds.
   ```
   
   The result seems as expected.
   
   If this change is merged, I'll open a PR to add the the test data to 
[arrow-testing](https://github.com/apache/arrow-testing). Then, I'll open a 
followup PR to add tests to 
[avro.slt](https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/avro.slt)
   
   ## Are there any user-facing changes?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] sarutak opened a new pull request, #7525: Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records

Reply via email to