eugenegujing opened a new issue, #5935:
URL: https://github.com/apache/texera/issues/5935
### What happened?
When a CSV column holds integer-looking values but also contains missing
values, the workflow crashes inside any pandas-based Python operator (e.g.
Sort).
Root cause chain:
1. CSV File Scan auto-infers such a column as `integer`
(`inferSchemaFromRows` in `AttributeTypeUtils.scala`), and there is no
per-column type override in the UI.
2. Python operators run on pandas. pandas has a hard rule: an integer column
that contains any NaN is automatically up-cast to float64 (an int column cannot
hold NaN). So `121` becomes `121.0`.
3. On output, the Python worker validates each tuple against the declared
schema, which still says INTEGER. The actual value is a float, so it raises:
```
TypeError: Unmatched type for field 'weight', expected AttributeType.INT,
got 119.0 (<class 'float'>) instead.
File ".../core/models/tuple.py", line 361, in validate_schema (called
from finalize -> on_finish)
```
This affects every "integer column that also has missing values" — the user
must hit the error, find the column, and manually cast it. In our dataset
(diabetes.csv) 11 columns are affected (weight, chol, hdl, height, bp.1s,
bp.1d, bp.2s, bp.2d, waist, hip, time.ppn).
Expected:
Integer columns containing nulls should be handled gracefully instead of
crashing. Either:
- (a) the Python worker's schema validation should coerce an integral float
(e.g. `119.0`) back to INTEGER and NaN to null, or
- (b) CSV File Scan should infer a null-containing integer column as DOUBLE,
or
- (c) the UI should expose a per-column type override on CSV File Scan.
Current workaround: insert a Type Casting operator and manually cast every
affected integer column to `double`. This works but is manual and error-prone
(casting to `integer` instead of `double` silently reproduces the bug).
### How to reproduce?
1. Prepare a CSV with an integer-valued column that contains at least one
empty cell, e.g. diabetes.csv where `weight` is all integers except one blank.
2. Build workflow: CSV File Scan -> Sort. In Sort, sort by any column (e.g.
`age`).
3. Run the workflow.
4. The Sort operator fails on finish with:
```
TypeError: Unmatched type for field 'weight', expected AttributeType.INT,
got 119.0 (<class 'float'>) instead.
```
Workaround that fixes it:
CSV File Scan -> Type Casting (cast weight/waist/hip/time.ppn -> double) ->
Sort, then re-run.
### Version/Branch
1.3.0-incubating-SNAPSHOT (main)
### Commit Hash (Optional)
_No response_
### What browsers are you seeing the problem on?
_No response_
### Relevant log output
```shell
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]