[I] Query on nested struct field with PyIceberg? [iceberg-python]

via GitHub Mon, 22 Jul 2024 05:03:31 -0700


cfrancois7 opened a new issue, #953:
URL: https://github.com/apache/iceberg-python/issues/953


   ### Question
   
   I'm looking for a tutorial to make a query on one subfield of a struct field.
   I scrolled all internet but failed to find a way to do it simply with 
pyiceberg.
   
   To make it concret, for instance how to get the row with `"employment.status 
= 'Employed'"` : 
   
   ```python
   [{'id': 1,
     'name': 'Alice',
     'age': 28,
     'address': {'street': '123 Maple St',
      'city': 'Springfield',
      'postal_code': '12345'},
     'contact': {'email': '[email protected]', 'phone': '555-1234'},
     'employment': {'status': 'Employed',
      'position': 'Software Engineer',
      'company': {'name': 'Tech Corp', 'location': 'Silicon Valley'}},
     'preferences': {'newsletter': True,
      'notifications': {'email': True, 'sms': False}}},
    {'id': 2,
     'name': 'Bob',
     'age': 35,
     'address': {'street': '456 Oak St',
      'city': 'Metropolis',
      'postal_code': '67890'},
     'contact': {'email': '[email protected]', 'phone': '555-5678'},
     'employment': {'status': 'Self-employed',
      'position': 'Consultant',
      'company': {'name': 'Freelance', 'location': 'Remote'}},
     'preferences': {'newsletter': False,
      'notifications': {'email': True, 'sms': True}}}]
   ```
      
   With the following schema:
   ```
    import pyarrow as pa
    
    schema = pa.schema([
     ('id', pa.int32()),
     ('name', pa.string()),
     ('age', pa.int32()),
     ('address', pa.struct([
         ('street', pa.string()),
         ('city', pa.string()),
         ('postal_code', pa.string())
     ])),
     ('contact', pa.struct([
         ('email', pa.string()),
         ('phone', pa.string())
     ])),
     ('employment', pa.struct([
         pa.field('status', pa.string(), nullable=True),
         pa.field('position', pa.string(), nullable=True),
         pa.field('company', pa.struct([
             ('name', pa.string()),
             ('location', pa.string())
         ]), nullable=True)
     ])),
     ('preferences', pa.struct([
         ('newsletter', pa.bool_()),
         ('notifications', pa.struct([
             ('email', pa.bool_()),
             ('sms', pa.bool_())
         ]))
     ]))
   ])
   ```
   
   I tried this kind of query, but without success:
   
   ```python
   row_filter = "employment.status = 'Employed'"
   
   table.scan(
       row_filter=row_filter,
       selected_fields=["age", "employment", 'contact.email']
   ).to_pandas()
   ```
   The command raises the error:
   ```Python
   ValueError: Could not find field with name status, case_sensitive=True
   ```
   
   The backend is supported by SQLite.
   
   versions:
   ```bash
   $ pip list | grep 'iceberg\|arrow\|sqlite'
   arrow                     1.3.0
   pyarrow                   15.0.2
   pyiceberg                 0.6.1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Query on nested struct field with PyIceberg? [iceberg-python]

Reply via email to