rymurr commented on pull request #1314:
URL: https://github.com/apache/iceberg/pull/1314#issuecomment-678798042


   Regarding Arrows dataset here is a **very** rough example of what I think we 
can do.
   
   The filters from this PR, partition filtering, different filesystem types 
etc are handled here. Partitions by iceberg metadata and the rest by pyarrow. 
What do you guys think?
   
   ``` python
   from iceberg.hive import HiveTables
   import pyarrow as pa
   import pyarrow.dataset as ds
   from pyarrow import fs
   
   if __name__ == '__main__':
   
       conf = {"hive.metastore.uris": 'thrift://localhost:9083',
               'hive.metastore.warehouse.dir': 
'/home/ryan/warehouse/iceberg/hive_test'}
       tables = HiveTables(conf)
   
       tbl = tables.load("testing.foo")
   
       # inspect metadata
       print(tbl.schema())
       print(tbl.spec())
       print(int(tbl.current_snapshot().summary.get("total-records")))
   
       scan = tbl.new_scan() \
           .filter("symbol==AUDCHF") \
           .select(["Bid", "Ask", "Datetime"])
   
       projection = scan.schema
       for task in scan.plan_tasks():
           dataset = 
ds.FileSystemDataset.from_paths([i.file._file_path.replace("file:","") for i in 
task.files],
                                                   schema=pa.schema([("Bid", 
pa.float64()), ("Ask", pa.float64()),
                                                                     
("DateTime", pa.timestamp("us", 'UTC'))]),
                                                   
format=ds.ParquetFileFormat(),
                                                   
filesystem=fs.LocalFileSystem())
           pytbl = dataset.to_table(filter=ds.field("Bid") > 0.75)
           df = pytbl.to_pandas()
           print(df)
   ```
   
   Output
   ```
   table {
    1: DateTime: optional timestamptz(None)
    2: Bid: optional double(None)
    3: Ask: optional double(None)
    4: symbol: optional string(None)
   }
   [
    1000: DateTime_day: day(1)
   ]
   846035
               Bid      Ask                         DateTime
   0       0.75935  0.76156 2018-01-01 21:58:33.821000+00:00
   1       0.75940  0.76155 2018-01-01 21:58:34.821000+00:00
   2       0.75943  0.76154 2018-01-01 21:58:35.733000+00:00
   3       0.75945  0.76153 2018-01-01 21:58:36.734000+00:00
   4       0.75947  0.76152 2018-01-01 21:58:37.733000+00:00
   ...         ...      ...                              ...
   388406  0.76612  0.76669 2018-01-05 21:59:00.070000+00:00
   388407  0.76611  0.76669 2018-01-05 21:59:00.336000+00:00
   388408  0.76611  0.76670 2018-01-05 21:59:00.809000+00:00
   388409  0.76524  0.76747 2018-01-05 21:59:01.011000+00:00
   388410  0.76534  0.76757 2018-01-05 21:59:12.367000+00:00
   
   [388411 rows x 3 columns]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to