nyoungstudios commented on issue #25917:
URL: https://github.com/apache/beam/issues/25917#issuecomment-1548887069

   @liferoad Thanks for uploading the files, I was checking some of the files 
and noticed that the type of the "airport_fee" field type doesn't look to be 
the union of `['null', 'long', 'double']`.
   
   Rather, it looks to be of type record with two parts, the first field of 
type long and the second of type double.
   
   Not sure the best way to quickly print the top of the file in Beam, but I 
ran these commands in the `spark-shell --packages 
org.apache.spark:spark-avro_2.12:3.1.1` on Google Dataproc.
   
   ```scala
   # running these commands on a converted avro file
   scala> val dfa = 
spark.read.format("avro").load("gs://apache-beam-samples/nyc_trip/avro/fhvhv_tripdata_2022-12.avro-00000-of-00001.avro")
   scala> dfa.printSchema
   root
    |-- hvfhs_license_num: string (nullable = true)
    |-- dispatching_base_num: string (nullable = true)
    |-- originating_base_num: string (nullable = true)
    |-- request_datetime: long (nullable = true)
    |-- on_scene_datetime: long (nullable = true)
    |-- pickup_datetime: long (nullable = true)
    |-- dropoff_datetime: long (nullable = true)
    |-- PULocationID: long (nullable = true)
    |-- DOLocationID: long (nullable = true)
    |-- trip_miles: double (nullable = true)
    |-- trip_time: long (nullable = true)
    |-- base_passenger_fare: double (nullable = true)
    |-- tolls: double (nullable = true)
    |-- bcf: double (nullable = true)
    |-- sales_tax: double (nullable = true)
    |-- congestion_surcharge: double (nullable = true)
    |-- airport_fee: struct (nullable = true)
    |    |-- member0: long (nullable = true)
    |    |-- member1: double (nullable = true)
    |-- tips: double (nullable = true)
    |-- driver_pay: double (nullable = true)
    |-- shared_request_flag: string (nullable = true)
    |-- shared_match_flag: string (nullable = true)
    |-- access_a_ride_flag: string (nullable = true)
    |-- wav_request_flag: string (nullable = true)
    |-- wav_match_flag: string (nullable = true)
   
   # running these commands on the parquet file
   scala> val df = 
spark.read.parquet("gs://apache-beam-samples/nyc_trip/parquet/fhvhv_tripdata_2022-12.parquet")
   scala> df.printSchema
   root
    |-- hvfhs_license_num: string (nullable = true)
    |-- dispatching_base_num: string (nullable = true)
    |-- originating_base_num: string (nullable = true)
    |-- request_datetime: timestamp (nullable = true)
    |-- on_scene_datetime: timestamp (nullable = true)
    |-- pickup_datetime: timestamp (nullable = true)
    |-- dropoff_datetime: timestamp (nullable = true)
    |-- PULocationID: long (nullable = true)
    |-- DOLocationID: long (nullable = true)
    |-- trip_miles: double (nullable = true)
    |-- trip_time: long (nullable = true)
    |-- base_passenger_fare: double (nullable = true)
    |-- tolls: double (nullable = true)
    |-- bcf: double (nullable = true)
    |-- sales_tax: double (nullable = true)
    |-- congestion_surcharge: double (nullable = true)
    |-- airport_fee: double (nullable = true)
    |-- tips: double (nullable = true)
    |-- driver_pay: double (nullable = true)
    |-- shared_request_flag: string (nullable = true)
    |-- shared_match_flag: string (nullable = true)
    |-- access_a_ride_flag: string (nullable = true)
    |-- wav_request_flag: string (nullable = true)
    |-- wav_match_flag: string (nullable = true)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to