rymurr commented on pull request #1314:
URL: https://github.com/apache/iceberg/pull/1314#issuecomment-678798042
Regarding Arrows dataset here is a **very** rough example of what I think we
can do.
The filters from this PR, partition filtering, different filesystem types
etc are handled here. Partitions by iceberg metadata and the rest by pyarrow.
What do you guys think?
``` python
from iceberg.hive import HiveTables
import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow import fs
if __name__ == '__main__':
conf = {"hive.metastore.uris": 'thrift://localhost:9083',
'hive.metastore.warehouse.dir':
'/home/ryan/warehouse/iceberg/hive_test'}
tables = HiveTables(conf)
tbl = tables.load("testing.foo")
# inspect metadata
print(tbl.schema())
print(tbl.spec())
print(int(tbl.current_snapshot().summary.get("total-records")))
scan = tbl.new_scan() \
.filter("symbol==AUDCHF") \
.select(["Bid", "Ask", "Datetime"])
projection = scan.schema
for task in scan.plan_tasks():
dataset =
ds.FileSystemDataset.from_paths([i.file._file_path.replace("file:","") for i in
task.files],
schema=pa.schema([("Bid",
pa.float64()), ("Ask", pa.float64()),
("DateTime", pa.timestamp("us", 'UTC'))]),
format=ds.ParquetFileFormat(),
filesystem=fs.LocalFileSystem())
pytbl = dataset.to_table(filter=ds.field("Bid") > 0.75)
df = pytbl.to_pandas()
print(df)
```
Output
```
table {
1: DateTime: optional timestamptz(None)
2: Bid: optional double(None)
3: Ask: optional double(None)
4: symbol: optional string(None)
}
[
1000: DateTime_day: day(1)
]
846035
Bid Ask DateTime
0 0.75935 0.76156 2018-01-01 21:58:33.821000+00:00
1 0.75940 0.76155 2018-01-01 21:58:34.821000+00:00
2 0.75943 0.76154 2018-01-01 21:58:35.733000+00:00
3 0.75945 0.76153 2018-01-01 21:58:36.734000+00:00
4 0.75947 0.76152 2018-01-01 21:58:37.733000+00:00
... ... ... ...
388406 0.76612 0.76669 2018-01-05 21:59:00.070000+00:00
388407 0.76611 0.76669 2018-01-05 21:59:00.336000+00:00
388408 0.76611 0.76670 2018-01-05 21:59:00.809000+00:00
388409 0.76524 0.76747 2018-01-05 21:59:01.011000+00:00
388410 0.76534 0.76757 2018-01-05 21:59:12.367000+00:00
[388411 rows x 3 columns]
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]