Hi everyone,

My name is Prathamesh Dhanashri, and I'd like to introduce myself as a new
contributor to Apache Wayang. I'm interested in contributing to the project
as part of Google Summer of Code (GSoC) and am looking to familiarize
myself with the codebase by working on open issues.

As a starting point, I've been investigating* issue #690 *(
https://github.com/apache/wayang/issues/690) reported by zkaoudi regarding
a CSV parsing error when reading from the filesystem using the SQL API. The
error occurs at *JavaCSVTableSource.java line 127* where *tokens.length !=
fieldTypes.size()*.

*Root Cause:*
In *WayangTableScanVisitor.java (line 67)*, the fieldTypes list is built
from *wayangRelNode.getRowType()*, which returns the RelNode's row type. In
certain configurations, this row type may have fewer fields than the actual
table schema (e.g., when Calcite optimizes away unused columns). However,
the CSV source always reads all columns from disk, causing a mismatch
between tokens.length and fieldTypes.size().

*Proposed Fix:*
Change line 67 of *WayangTableScanVisitor.java* from:
*final List<RelDataType> fieldTypes =
wayangRelNode.getRowType().getFieldList().stream()*
to:
*final List<RelDataType> fieldTypes =
wayangRelNode.getTable().getRowType().getFieldList().stream()*
Using *getTable().getRowType()* always returns the full table schema,
consistent with how getColumnNames() already works in *WayangTableScan.java
(line 98)*. The downstream WayangProject operator handles column selection
separately via a MapOperator.

*Testing:*
I've written a regression test using Mockito that simulates a
WayangTableScan with a trimmed row type (1 field) while the table has 4
fields, reproducing the exact scenario described in the issue. The test
fails before the fix and passes after. All existing tests continue to pass.

I plan to open a PR with this fix and the regression test shortly.

Looking forward to your feedback!

Thanks,
Prathamesh Dhanashri

Reply via email to