dxdc opened a new issue, #47502: URL: https://github.com/apache/arrow/issues/47502
### Describe the enhancement requested ## Motivation / use-case Many users need to coerce **all** columns of a CSV to the same Arrow type—most commonly `string()` to keep raw text—when the schema is unknown or very wide. Today the API only permits either: * passing an explicit `column_types={"colA": pa.string(), …}` map, **or** * letting the reader infer per-column types. That forces callers to a) know every header in advance and b) enumerate them, which is painful for dynamic files. The limitation was raised in [*ARROW-5811*](https://github.com/apache/arrow/issues/22232). Current docs confirm no built-in way exists beyond the explicit map. --- ## Proposed change ### **Option A – sentinel entry in `column_types`** Honor a magic key (e.g. `"*"`, `"__default__"`, or a constant `kWildcardColumn`) inside `ConvertOptions.column_types`. Lookup order in `MakeConversionSchema()` becomes: 1. exact match in `column_types` 2. sentinel key 3. current fallback (type inference) ### **Option B – new field `default_column_type`** Add `std::shared_ptr<DataType> default_column_type = nullptr` to `ConvertOptions`. If non-null, columns **not** listed in `column_types` are converted with that type. Both approaches are backwards-compatible; Option B is explicit and avoids magic strings, while Option A is a one-line API addition. --- ## Python examples ```python import pyarrow as pa, pyarrow.csv as pcsv # Option A (sentinel) opts = pcsv.ConvertOptions(column_types={"*": pa.string(), "id": pa.int64()}) tbl = pcsv.read_csv("data.csv", convert_options=opts) # Option B (explicit field) opts = pcsv.ConvertOptions( default_column_type=pa.string(), # NEW column_types={"id": pa.int64()} # explicit override ) tbl = pcsv.read_csv("data.csv", convert_options=opts) ``` ### Affected code (C++ path overview) | Layer | File(s) | Change summary | Notes | |-------|---------|----------------|-------| | **Public API** | `cpp/src/arrow/csv/options.h` | • **Add** `std::shared_ptr<DataType> default_column_type;` to `struct ConvertOptions` (Option B) **or** define `static const std::string kWildcardColumn = "__default__";` (Option A).<br>• Document the new knob in the Doxygen comment. | Keeps the setting user-visible. | | | `cpp/src/arrow/csv/options.cc` | • In `ConvertOptions::Defaults()`, initialise `opts.default_column_type = nullptr;`.<br>• Extend `ConvertOptions::Validate()` to raise `Status::Invalid` for an illegal dtype or duplicate sentinel. | Ensures default behaviour remains unchanged. | | **Core logic** | `cpp/src/arrow/csv/reader.cc` — inside `MakeConversionSchema()` | Replace the existing two-branch decision with a three-branch cascade:<br> 1. **explicit mapping** →<br> 2. **default_column_type / sentinel** →<br> 3. **infer type** (legacy path). | ~10 LOC patch; confined to one lambda. | | **Unit tests (C++)** | `cpp/src/arrow/csv/options_test.cc` (new) | Add three cases:<br>• default only – every column gets that type.<br>• default + explicit overrides – explicit wins.<br>• default == nullptr – legacy inference. | Guards against regressions. | | **Python binding** | `python/pyarrow/_csv.cpp` (Cython) | • **Expose** `default_column_type` keyword (accept `None` or `DataType`).<br>• Map to/from the underlying C++ field. | Maintains PyArrow feature parity. | | | `python/pyarrow/tests/test_csv.py` | Mirror the three C++ test scenarios. | Confirms binding wiring. | | **Documentation** | `docs/source/cpp/csv.rst`, `docs/source/python/csv.rst` | Add one bullet and a quick example for the new option. | Makes the feature discoverable. | | **Other bindings** (optional) | R, GLib, Rust wrappers | Add the field/property if those wrappers already expose `ConvertOptions`. | Can be staged separately. | > **Build system:** No CMake or Meson tweaks are required—the dataset/file-CSV paths automatically inherit the updated `ConvertOptions`. --- ### Cross-language bindings checklist | Language | File / area | Binding note | |----------|-------------|--------------| | **Python (pyarrow)** | `_csv.cpp` | add `default_column_type` kwarg with `None` ⇒ `nullptr` | | **R** (`arrow::r::csv`) | `r/src/` | mirror the field in `convert_options()` constructor | | **GLib** | `glib/arrow-gio/csv-options.cpp` | expose property `default-column-type` | | **Rust** | `arrow-csv` crate | add `default_column_type: Option<DataType>` | | **Java / JNI** | none (CSV reader lives in C++ backend) | no change | These additions are mechanical once the C++ core is in place. --- ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org