Re: [DISCUSS] Column-oriented indexing API for IndexWriter

neoremind Wed, 06 May 2026 04:04:53 -0700

Hi Tim,

Thanks for sharing the initial results, this is a cool new feature! Share
my 2 cents.


In terms of luceneutil benchmark, your observation is right, the nightly
bench indexing throughput on the Wikipedia dataset is heavily dominated by
inverted index building (tokenization, postings..). Disabling the inverted
index path should give a much cleaner signal for just dv/points oriented
improvements you're targeting.

On the luceneutil benchmark overhead: I spent some time profiling the
document construction path in `LineFileDocs` recently (
https://github.com/mikemccand/luceneutil/pull/566). The parsing + document
object construction overhead is about ~3% of total indexing time on a
single-threaded run, and even a bit less if you use the binary file. Also,
the doc-reading thread runs separately from the indexer threads, it
shouldn't starve Lucene. I'm not sure how you construct the dv fields in
your benchmark, if it were 150-1000 fields as you described in the PR, then
purpose profiling on both luceneutil and lucene core would be needed, just
share what I practiced: you could verify the overhead by commenting out
`w.addDocument()` in `IndexThreads` to isolate the doc building cost from
the actual indexing work. Also, async-profiler flamegraph is really
helpful, you can see exactly how much time goes to field construction vs.
the real indexing workload. In addition, I find Lucene indexing is not that
I/O intensive in nightly bench, most of the time is spent on CPU, so the
CPU view flamegraph would explain enough.

It's worth noting the difference between the two nightly benchmarks for
context (correct me if I am wrong): "~1 KB Wikipedia English docs"
benchmark uses multi-thread indexing with no dv and no facets, with cms,
and no waiting on merge/commit, while "Fixed (deterministic, 1-thread)
search index build throughput" benchmark uses single-thread indexing with
dv, and facets (which are pretty heavy in construction/parsing), no CMS,
and waits on merge/commit. You could tweak the nightly bench locally by
bringing your own schema (guess you have already done so:) In addition, you
can check the per-run flamegraphs, memory allocation, GC, and file I/O
profiling on https://blunders.io/lucene-bench for more insights.

I also like the column batch idea. Essentially, I think this is another
form of vectorized processing (vs. vector search or  Panama Vector API). It
reminds me of the Vectorwise / MonetDB X100 line of work: the classic
`Volcano` iterator model forces tuple-at-a-time execution, which causes
both low IPC and hides opportunities for CPU-level parallelism and cache
locality. The column-at-a-time execution solves this with better CPU
pipelining and the compiler can actually see the loop structure. Similar to
the query path, it totally makes sense to have `ColumnBatch` work at
indexing path as well.

Looking forward to seeing the more detailed results, numbers, flamegraphs
and the micro benchmark numbers.

Best,
Neoremind

Tim Brooks <[email protected]> 于2026年5月3日周日 10:18写道：

> I'll send some full results in a few days.
>
> I did a first pass and the improvements were pretty minor for the sparse
> column variant. After investigating a bit that benchmark still has a
> considerable number of inverted index fields which drops back to row
> processing. When I switched all fields to docvalues the gains were in the
> 15-20% range. This is still much smaller than what we are seeing (2-4X from
> sparse to dense). I suspect it is because a considerable amount of the
> lunceneutils indexing benchmarks are consumed by reading from the file,
> parsing date times, etc.
>
> I'll investigate a bit more and share the more specific results and the
> changes I made to the benchmark I made to surface the docvalue/points
> oriented improvements. I'll also share some details on I'm see in my
> macrobenchmarks where the gains are larger.
>
> --
>   Tim Brooks
>   [email protected]
>
> On Thu, Apr 30, 2026, at 1:47 PM, Adrien Grand wrote:
>
> Very cool! I remember wanting something like this when I was looking into
> making Lucene a bit better at ingesting small structured documents like
> IndexGeoNames (
> https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/extra/perf/IndexGeoNames.java).
> Is your POC complete enough to get a sense of the speedup that we'd get on
> this benchmark?
>
> On Tue, Apr 28, 2026 at 3:53 AM Tim Brooks <[email protected]> wrote:
>
> Hi all,
>
> I'd like to propose adding a column-oriented document-ingestion API to
> IndexWriter and get early feedback on the shape before opening a PR. I've
> been prototyping this on a branch and would like to understand community
> appetite before pushing further.
>
> https://github.com/apache/lucene/pull/15990
>
> ## The concept
>
> Today IndexWriter consumes an Iterable<IndexableField> per document: the
> indexing chain walks each field, re-resolves FieldInfo / PerField state,
> revalidates the field type against the schema, and interleaves
> stored-fields, postings, doc-values and points per document.
>
> The proposal is to add a parallel intake path:
> IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns,
> where each Column represents one field across all documents in the batch.
> The indexing chain then processes the batch in two passes:
>
> 1. A row-oriented pass for stored fields and the inverted index (per-doc
> processing still matters there).
> 2. A column-oriented pass for doc values, vectors, and points (where
> per-field bulk writes are a natural fit).
>
> Column itself is just metadata (name, IndexableFieldType, density).
> Iteration happens through typed cursors obtained from the subclasses:
> LongColumn for numeric DV, 1-D points, and numeric stored; BinaryColumn for
> binary/sorted DV, text/binary stored, and binary-encoded points; and
> VectorColumn for KNN vectors. Each cursor call returns a fresh cursor, so a
> column can be traversed once in the row pass and again in the column pass.
>
> ## Two benefits motivate this:
>
> 1. More compact in-memory representation during indexing. A column batch
> avoids the per-field allocations of the document-at-a-time path
> (IndexableField instances, per-doc FieldType references, per-doc attribute
> maps). For numeric DV and points in particular, the caller can hand us a
> primitive-backed cursor that the chain drains directly into
> PackedLongValues / the points writer without indirection.
> 2. Less redundant field validation. Field name, type, indexing options,
> and schema compatibility are resolved once per column instead of once per
> IndexableField. For workloads where a caller already knows the schema of a
> batch, that revalidation is pure overhead.
>
> All in all, these changes drop CPU usage dedicated to
> IndexWriter#addDocuments 4-5x for analytic heavy workloads.
>
> No changes to on-disk format; this is an ingestion-side API only.
>
> ## MVP: sparse columns
>
> The minimum useful version is sparse-only: every column is allowed to skip
> doc-ids or have multiple values per doc-id, and the chain goes through the
> same per-doc paths it uses today (just driven by a cursor instead of an
> IndexableField stream). This is enough to land the API, the two-pass
> consumer, and the public addBatch entry point without touching the
> doc-values / points writers.
>
> ## Follow-on option: dense columns
>
> The bigger performance wins come from advertising a column as dense —
> every doc in [0, numDocs) has exactly one value. That lets the chain:
>
> - Skip the sparse-bitset bookkeeping in NumericDocValuesWriter /
> SortedNumericDocValuesWriter entirely on the dense path.
> - Bulk-fill straight into PackedLongValues from the column's values()
> cursor, avoiding the per-value add loop.
> - For 1-D numeric points, feed the BKD writer from the same dense
> primitive cursor instead of one BytesRef at a time.
> - For n-D numeric points, a fixed size binary column could feed multiple
> document points in a single write. This is an expert scenario as users have
> to serialize the points properly in sort order in the column.
>
> Density is asserted by the column up-front so the chain can pick the path
> without probing.
>
> ## Follow-on option: Ergonomic builders
>
> I have focused on very low-level apis (abstract long and byte columns
> implemented by users). Lucene could eventually add builders to create
> columns easier (similar to IntField, LongField, etc).
>
> ## Indexed-only terms ("DOCS + no norms") as a column
>
> One more case worth flagging: fields indexed with IndexOptions.DOCS and no
> norms — keyword/filter-style fields — don't need per-doc TokenStream
> plumbing. A BinaryColumn over such a field can feed the postings writer
> directly (one BytesRef per doc, no analysis, no norm accumulation). I have
> not implemented this in my POC.
>
> ## Scope of the initial proposal
>
> - New package org.apache.lucene.document.column with ColumnBatch, Column,
> LongColumn, BinaryColumn, and their cursors.
> - New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through
> DocumentsWriter / DocumentsWriterPerThread.
> - Indexing-chain changes to support the two-pass consumer.
> - All marked @lucene.experimental.
> - Try to implement as much of the column oriented processing in the column
> package to keep things experimental as long as possible.
>
> Would love feedback on if this is something Lucene is interested in or
> would be open to. It would help significantly in the analytical case and
> remove significant indirection and memory usage amplification on the
> per-field allocations.
>
> Thanks,
> Tim
>
> --
>   Tim Brooks
>   [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> --
> Adrien
>
>
>

-- 

Best regards,

neoremind@

Re: [DISCUSS] Column-oriented indexing API for IndexWriter

Reply via email to