Re: [PR] [SPARK-49016][SQL][DOCS] Remove `CSV` about queries are disallowed when the referenced columns only include the internal corrupt record column in `sql-migration-guide.md` [spark]

via GitHub Mon, 29 Jul 2024 09:09:25 -0700


wayneguow commented on code in PR #47506:
URL: https://github.com/apache/spark/pull/47506#discussion_r1695499514



##########
docs/sql-migration-guide.md:
##########
@@ -627,7 +627,7 @@ license: |
 
 ## Upgrading from Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
the referenced columns only include the internal corrupt record column (named 
`_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.
+  - Since Spark 2.3, the queries from raw JSON files are disallowed when the 
referenced columns only include the internal corrupt record column (named 
`_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.

Review Comment:
   After doing some investigation on the related change history of 
`CSVFileFormat`, I found that there was indeed relevant PR(#19199) for CSV 
before, but it was removed in this PR(#35817 , it seems to be to solve the 
push-down problem related to filters, but I don’t know why the previous code 
related to `requiredSchema` was removed.). 
   
   And I also confirmed that, with the current code, if you only select 
`columnNameOfCorruptRecord`, the results are  all null. The result is 
inappropriate.
   <img width="1030" alt="image" 
src="https://github.com/user-attachments/assets/f23a3e5e-7d59-4716-ac50-05ca7e37624e";>
   
   
   So I think we'd better restore the previous detection and throw relevant 
exceptions code. (I don’t know if I missed some questions.)  WDYT? @HyukjinKwon 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49016][SQL][DOCS] Remove `CSV` about queries are disallowed when the referenced columns only include the internal corrupt record column in `sql-migration-guide.md` [spark]

Reply via email to