nyoungstudios commented on issue #25917:
URL: https://github.com/apache/beam/issues/25917#issuecomment-1548887069
@liferoad Thanks for uploading the files, I was checking some of the files
and noticed that the type of the "airport_fee" field type doesn't look to be
the union of `['null', 'long', 'double']`.
Rather, it looks to be of type record with two parts, the first field of
type long and the second of type double.
Not sure the best way to quickly print the top of the file in Beam, but I
ran these commands in the `spark-shell --packages
org.apache.spark:spark-avro_2.12:3.1.1` on Google Dataproc.
```scala
# running these commands on a converted avro file
scala> val dfa =
spark.read.format("avro").load("gs://apache-beam-samples/nyc_trip/avro/fhvhv_tripdata_2022-12.avro-00000-of-00001.avro")
scala> dfa.printSchema
root
|-- hvfhs_license_num: string (nullable = true)
|-- dispatching_base_num: string (nullable = true)
|-- originating_base_num: string (nullable = true)
|-- request_datetime: long (nullable = true)
|-- on_scene_datetime: long (nullable = true)
|-- pickup_datetime: long (nullable = true)
|-- dropoff_datetime: long (nullable = true)
|-- PULocationID: long (nullable = true)
|-- DOLocationID: long (nullable = true)
|-- trip_miles: double (nullable = true)
|-- trip_time: long (nullable = true)
|-- base_passenger_fare: double (nullable = true)
|-- tolls: double (nullable = true)
|-- bcf: double (nullable = true)
|-- sales_tax: double (nullable = true)
|-- congestion_surcharge: double (nullable = true)
|-- airport_fee: struct (nullable = true)
| |-- member0: long (nullable = true)
| |-- member1: double (nullable = true)
|-- tips: double (nullable = true)
|-- driver_pay: double (nullable = true)
|-- shared_request_flag: string (nullable = true)
|-- shared_match_flag: string (nullable = true)
|-- access_a_ride_flag: string (nullable = true)
|-- wav_request_flag: string (nullable = true)
|-- wav_match_flag: string (nullable = true)
# running these commands on the parquet file
scala> val df =
spark.read.parquet("gs://apache-beam-samples/nyc_trip/parquet/fhvhv_tripdata_2022-12.parquet")
scala> df.printSchema
root
|-- hvfhs_license_num: string (nullable = true)
|-- dispatching_base_num: string (nullable = true)
|-- originating_base_num: string (nullable = true)
|-- request_datetime: timestamp (nullable = true)
|-- on_scene_datetime: timestamp (nullable = true)
|-- pickup_datetime: timestamp (nullable = true)
|-- dropoff_datetime: timestamp (nullable = true)
|-- PULocationID: long (nullable = true)
|-- DOLocationID: long (nullable = true)
|-- trip_miles: double (nullable = true)
|-- trip_time: long (nullable = true)
|-- base_passenger_fare: double (nullable = true)
|-- tolls: double (nullable = true)
|-- bcf: double (nullable = true)
|-- sales_tax: double (nullable = true)
|-- congestion_surcharge: double (nullable = true)
|-- airport_fee: double (nullable = true)
|-- tips: double (nullable = true)
|-- driver_pay: double (nullable = true)
|-- shared_request_flag: string (nullable = true)
|-- shared_match_flag: string (nullable = true)
|-- access_a_ride_flag: string (nullable = true)
|-- wav_request_flag: string (nullable = true)
|-- wav_match_flag: string (nullable = true)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]