litao3rd commented on issue #37840:
URL: https://github.com/apache/arrow/issues/37840#issuecomment-1732365861
Sorry for my mistake.
I am currently using version 12.0.1 of Arrow.
To simplify the process, I have created a small script for downloading a few
files that can reproduce this issue. The script will create a directory called
"tlc-trip-record-data" in the current directory and download the necessary data
into this directory. Please note that we are behind the Great Firewall, so you
may need to execute this script with an optional proxy using the following
command: sh ./download-tcl-trip-record-data.sh --proxy protocol://host:port.
I have encountered a perplexing issue. When I use 6 months of data, the
first block returns a total of 142,516,648 rows, while the other block returns
0 rows. However, when I use 5 months of data, excluding January, both blocks
yield the same result. Unfortunately, I am unable to identify the bug in Arrow
due to its complexity.
Please note that you need to modify the path to tlc-trip-record-data
directory in cpp codes.
``` download-tcl-trip-record-data.sh
#!/bin/bash
set -e
dataset="tlc-trip-record-data"
test -d $dataset || mkdir $dataset
pushd $dataset
for n in $(seq 1 6); do
d="2023-0$n"
echo "[*] downloading data for $d"
curl $@ -LO
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_$d.parquet"
curl $@ -LO
"https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_$d.parquet"
curl $@ -LO
"https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_$d.parquet"
curl $@ -LO
"https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_$d.parquet"
echo "[*] finished download data for $d"
done
popd
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]