TL;DR: How much do we care about our client APIs supporting multi-valued fields where each value is a "Vector" (aka: "java.util.List" of numbers), when the alternative (for now) is to use a String representation of each value ?

---

Once upon a time, I was working on some custom plugin work where I had a field type using "Tuples" (of strings) as values -- but the field type also needed to be multi-valued (ie: a variable number of tuples in a single document)

Writing my field type (as a composite around multiple abstracted away indexed & docValue fields) was pretty easy, but I ran into weirdness writing tests with indexing (with SolrInputDocument), and getting my "stored" values back (via SolrDocument) because I was trying to represent my "Tuples" in code as java.util.List instances (to play nice with Solr's existing serialization logic, response writers, etc).

So *each* field value was a List<String>

The problem came in my tests of multi-valued fields -- due to long standing "convinience" logic in SolrInputDocument and SolrDocument those classes that assumes that if you set/add a java.util.Collection as a field "value", the *real* field values are the contents of that Collection -- it either re-uses it as is, or will add *each* of those elements to it's existing java.util.Collection for that field.

(Since this email was getting really long & convoluted the first 3 times I tried to draft it, I created SOLR-17974 to go into all the details and attached a test case showing how weird it all is.)


Even though it was/is possible -- since I knew how the internals work -- to carefully set a field value in SolrInputDocument top be a List<List<String>>, I instead gave up and represented each "Tuple" as a String using a special delimiter -- making my "multi-valued" fields a List<String>.

(Which is/was similar to how Solr's spatial field types work, and plays nice with all ContentStream loaders and response writers)



I was reminded of all this about a year ago when I realized:

1) Solr's DenseVectorField can only be multiValues=false

2) For external representation purposes, DenseVectorField *acts* like a multi-valued numeric field (either List<Float> or List<Byte>)

3) There was/is work being considered in Lucene to add "multiple vectors per document" to the underlying HSNW graph logic (lucene/issues/12313)


All of which raised my eyebrow: "I wonder how Solr's going to deal with the List<List<Float>> problem if/when that happens?" ... but I had other things on my mind at the time.



Skip ahead to today...

Lucene still doesn't support multiple vectors per document in the HNSW graphs, but 10.3 *did* introduce a new LateInteractionField which is a different type of "vector" field where each document has a "float[][]" (a variable number of fixed sized "vectors", each represented as a fixed size float[]).

So the question becomes: How to we represent these document values in Solr if we want to add support for this new field type? (SOLR-17975)


The most expedient approach would be to follow in the footsteps of the spatial fields (and my old "Tuple" type) and use a String encoding -- either mapping a String<->float[] and being multiValued=true in schema, or mapping String<->float[][] and requiring multiValued=false (the "query" side of this field type will already need some way for users to express a "float[][]" in a query string)


The alternative is to pay off our very old tech dept: SOLR-17974.

*IF* we redesign SolrInputDocument and SolrDocument to have more explicit APIs allowing for the possibility that a single "value" in a multi-valued field might be represented as a complex (possibly nested) java.util.Collection of primatives (recognizing that along the way, we will probably find lots of other places in the code base that make assumptions about multi-valued fields, just because it's alwasy been that way.) ... *THEN* ... we could model each vector value as a "List<Float>" and still have "multi-valued" vector fields.


But how much to people actaully care about this?

Are there other usecases where his would be useful?

Is having a "cleaner" API worth the headaches of changing this now?

   ?






-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to