[I] CSV reader: add a default column type (or sentinel mapping) to avoid per-column enumeration [arrow]

via GitHub Thu, 04 Sep 2025 08:53:52 -0700


dxdc opened a new issue, #47502:
URL: https://github.com/apache/arrow/issues/47502


   ### Describe the enhancement requested
   
   ## Motivation / use-case
   Many users need to coerce **all** columns of a CSV to the same Arrow 
type—most commonly `string()` to keep raw text—when the schema is unknown or 
very wide.  
   
   Today the API only permits either:
   
   * passing an explicit `column_types={"colA": pa.string(), …}` map, **or**
   * letting the reader infer per-column types.
   
   That forces callers to a) know every header in advance and b) enumerate 
them, which is painful for dynamic files.  
   The limitation was raised in 
[*ARROW-5811*](https://github.com/apache/arrow/issues/22232).
   
   Current docs confirm no built-in way exists beyond the explicit map.
   
   ---
   
   ## Proposed change
   
   ### **Option A – sentinel entry in `column_types`**  
   Honor a magic key (e.g. `"*"`, `"__default__"`, or a constant 
`kWildcardColumn`) inside `ConvertOptions.column_types`.  
   Lookup order in `MakeConversionSchema()` becomes:
   
   1. exact match in `column_types`  
   2. sentinel key  
   3. current fallback (type inference)
   
   ### **Option B – new field `default_column_type`**  
   Add `std::shared_ptr<DataType> default_column_type = nullptr` to 
`ConvertOptions`.  
   If non-null, columns **not** listed in `column_types` are converted with 
that type.
   
   Both approaches are backwards-compatible; Option B is explicit and avoids 
magic strings, while Option A is a one-line API addition.
   
   ---
   
   ## Python examples
   
   ```python
   import pyarrow as pa, pyarrow.csv as pcsv
   
   # Option A (sentinel)
   opts = pcsv.ConvertOptions(column_types={"*": pa.string(), "id": pa.int64()})
   tbl  = pcsv.read_csv("data.csv", convert_options=opts)
   
   # Option B (explicit field)
   opts = pcsv.ConvertOptions(
           default_column_type=pa.string(),   # NEW
           column_types={"id": pa.int64()}  # explicit override
   )
   tbl = pcsv.read_csv("data.csv", convert_options=opts)
   ```
   
   ### Affected code (C++ path overview)
   
   | Layer | File(s) | Change summary | Notes |
   |-------|---------|----------------|-------|
   | **Public API** | `cpp/src/arrow/csv/options.h` | • **Add** 
`std::shared_ptr<DataType> default_column_type;` to `struct ConvertOptions` 
(Option B) **or** define `static const std::string kWildcardColumn = 
"__default__";` (Option A).<br>• Document the new knob in the Doxygen comment. 
| Keeps the setting user-visible. |
   | | `cpp/src/arrow/csv/options.cc` | • In `ConvertOptions::Defaults()`, 
initialise `opts.default_column_type = nullptr;`.<br>• Extend 
`ConvertOptions::Validate()` to raise `Status::Invalid` for an illegal dtype or 
duplicate sentinel. | Ensures default behaviour remains unchanged. |
   | **Core logic** | `cpp/src/arrow/csv/reader.cc` — inside 
`MakeConversionSchema()` | Replace the existing two-branch decision with a 
three-branch cascade:<br>  1. **explicit mapping** →<br>  2. 
**default_column_type / sentinel** →<br>  3. **infer type** (legacy path). | 
~10 LOC patch; confined to one lambda. |
   | **Unit tests (C++)** | `cpp/src/arrow/csv/options_test.cc` (new) | Add 
three cases:<br>• default only – every column gets that type.<br>• default + 
explicit overrides – explicit wins.<br>• default == nullptr – legacy inference. 
| Guards against regressions. |
   | **Python binding** | `python/pyarrow/_csv.cpp` (Cython) | • **Expose** 
`default_column_type` keyword (accept `None` or `DataType`).<br>• Map to/from 
the underlying C++ field. | Maintains PyArrow feature parity. |
   | | `python/pyarrow/tests/test_csv.py` | Mirror the three C++ test 
scenarios. | Confirms binding wiring. |
   | **Documentation** | `docs/source/cpp/csv.rst`, 
`docs/source/python/csv.rst` | Add one bullet and a quick example for the new 
option. | Makes the feature discoverable. |
   | **Other bindings** (optional) | R, GLib, Rust wrappers | Add the 
field/property if those wrappers already expose `ConvertOptions`. | Can be 
staged separately. |
   
   > **Build system:** No CMake or Meson tweaks are required—the 
dataset/file-CSV paths automatically inherit the updated `ConvertOptions`.
   
   ---
   
   ### Cross-language bindings checklist
   | Language | File / area | Binding note |
   |----------|-------------|--------------|
   | **Python (pyarrow)** | `_csv.cpp` | add `default_column_type` kwarg with 
`None` ⇒ `nullptr` |
   | **R** (`arrow::r::csv`) | `r/src/` | mirror the field in 
`convert_options()` constructor |
   | **GLib** | `glib/arrow-gio/csv-options.cpp` | expose property 
`default-column-type` |
   | **Rust** | `arrow-csv` crate | add `default_column_type: Option<DataType>` 
|
   | **Java / JNI** | none (CSV reader lives in C++ backend) | no change |
   
   These additions are mechanical once the C++ core is in place.
   
   ---
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] CSV reader: add a default column type (or sentinel mapping) to avoid per-column enumeration [arrow]

Reply via email to