TL;DR: How much do we care about our client APIs supporting multi-valued
fields where each value is a "Vector" (aka: "java.util.List" of numbers),
when the alternative (for now) is to use a String representation of each
value ?
---
Once upon a time, I was working on some custom plugin work where I had a
field type using "Tuples" (of strings) as values -- but the field type
also needed to be multi-valued (ie: a variable number of tuples in a
single document)
Writing my field type (as a composite around multiple abstracted away
indexed & docValue fields) was pretty easy, but I ran into weirdness
writing tests with indexing (with SolrInputDocument), and getting my
"stored" values back (via SolrDocument) because I was trying to represent
my "Tuples" in code as java.util.List instances (to play nice with Solr's
existing serialization logic, response writers, etc).
So *each* field value was a List<String>
The problem came in my tests of multi-valued fields -- due to long
standing "convinience" logic in SolrInputDocument and SolrDocument those
classes that assumes that if you set/add a java.util.Collection as a field
"value", the *real* field values are the contents of that Collection -- it
either re-uses it as is, or will add *each* of those elements to it's
existing java.util.Collection for that field.
(Since this email was getting really long & convoluted the first 3 times I
tried to draft it, I created SOLR-17974 to go into all the details and
attached a test case showing how weird it all is.)
Even though it was/is possible -- since I knew how the internals work --
to carefully set a field value in SolrInputDocument top be a
List<List<String>>, I instead gave up and represented each "Tuple" as a
String using a special delimiter -- making my "multi-valued" fields a
List<String>.
(Which is/was similar to how Solr's spatial field types work, and plays
nice with all ContentStream loaders and response writers)
I was reminded of all this about a year ago when I realized:
1) Solr's DenseVectorField can only be multiValues=false
2) For external representation purposes, DenseVectorField *acts* like a
multi-valued numeric field (either List<Float> or List<Byte>)
3) There was/is work being considered in Lucene to add "multiple vectors
per document" to the underlying HSNW graph logic (lucene/issues/12313)
All of which raised my eyebrow: "I wonder how Solr's going to deal with
the List<List<Float>> problem if/when that happens?" ... but I had other
things on my mind at the time.
Skip ahead to today...
Lucene still doesn't support multiple vectors per document in the HNSW
graphs, but 10.3 *did* introduce a new LateInteractionField which is a
different type of "vector" field where each document has a "float[][]" (a
variable number of fixed sized "vectors", each represented as a fixed size
float[]).
So the question becomes: How to we represent these document values in Solr
if we want to add support for this new field type? (SOLR-17975)
The most expedient approach would be to follow in the footsteps of the
spatial fields (and my old "Tuple" type) and use a String encoding --
either mapping a String<->float[] and being multiValued=true in schema, or
mapping String<->float[][] and requiring multiValued=false (the "query"
side of this field type will already need some way for users to express a
"float[][]" in a query string)
The alternative is to pay off our very old tech dept: SOLR-17974.
*IF* we redesign SolrInputDocument and SolrDocument to have more explicit
APIs allowing for the possibility that a single "value" in a multi-valued
field might be represented as a complex (possibly nested)
java.util.Collection of primatives (recognizing that along the way, we
will probably find lots of other places in the code base that make
assumptions about multi-valued fields, just because it's alwasy been that
way.) ... *THEN* ... we could model each vector value as a "List<Float>"
and still have "multi-valued" vector fields.
But how much to people actaully care about this?
Are there other usecases where his would be useful?
Is having a "cleaner" API worth the headaches of changing this now?
?
-Hoss
http://www.lucidworks.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]