Posmac commented on issue #5204:
URL: https://github.com/apache/datafusion/issues/5204#issuecomment-4689755710
Got interesting results:
**Prerequisites:**
This dataset was used
```
id,name,age,country
1,Nicolai,22,Moldova
2,Bob,68,Sweden�������������
3,Charlie,45,USA
```
It has 13 invalid UTF-8 characters
```
pub const INVALID_UTF8: [u8; 13] = [
0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE,
0xFF,
];
```
-------------------------------------------------------
**Datafusion(separate project)**:
Project dependencies:
[dependencies]
datafusion = "54.0.0"
tokio = { version = "1.52.3", features = ["full"] }
```
use datafusion::{
error::{Result},
execution::{context::SessionContext, options::CsvReadOptions},
};
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
ctx.register_csv("invalid", "invalid_utf8.csv",
CsvReadOptions::new()).await?;
let df = ctx.sql("SELECT * from invalid").await?;
df.show().await?;
Ok(())
}
```
**Run**
`cargo run`
**Result**:
```
Finished `release` profile [optimized] target(s) in 0.39s
Running `target/release/datafusion_sandbox`
+----+---------+-----+---------------------+
| id | name | age | country |
+----+---------+-----+---------------------+
| 1 | Alice | 22 | Moldova |
| 2 | Bob | 68 | Sweden������������� |
| 3 | Charlie | 45 | USA |
+----+---------+-----+---------------------+
```
-------------------------------------------------------
**Datafusion(inside the datafusion-examples)**:
Create a folder sandbox: `datafusion-examples/examples/sandbox`, put dataset
inside: `datafusion-examples/data/csv/invalid_utf8.csv`
```
use datafusion::{
error::{Result},
execution::{context::SessionContext, options::CsvReadOptions},
};
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
ctx.register_csv("invalid",
"datafusion-examples/data/csv/invalid_utf8.csv", CsvReadOptions::new()).await?;
let df = ctx.sql("SELECT * from invalid").await?;
df.show().await?;
Ok(())
}
```
**Run (in the main root of the datafusion project)**:
cargo run --examples sandbox
**Result**:
`Error: Context("Error when processing CSV file
Users/nicolaiposmac/Work/datafusion/datafusion-examples/data/csv/invalid_utf8.csv",
ArrowError(CsvError("Encountered UTF-8 error while reading CSV file: invalid
utf-8: invalid UTF-8 in field 3 near byte index 6 at line 3"), Some("")))`
-------------------------------------------------------
**Spark(pyspark)**:
```
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DF_vs_Spark_Test") \
.master("local[*]") \
.config("spark.driver.bindAddress", "127.0.0.1") \
.getOrCreate()
print("Reading CSV using Spark...")
df = spark.read.csv("invalid_utf8.csv", header=True)
df.show()
for row in df.collect():
print(f"Row data: id={row['id']}, name={row['name']}")
spark.stop()
```
**Run**:
`python3 spark_test.py`
**Result**:
```
Reading CSV using Spark...
+---+-------+---+-------------------+
| id| name|age| country|
+---+-------+---+-------------------+
| 1| Alice| 22| Moldova|
| 2| Bob| 68|Sweden�������������|
| 3|Charlie| 45| USA|
+---+-------+---+-------------------+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]