lwhite1 commented on code in PR #14213:
URL: https://github.com/apache/arrow/pull/14213#discussion_r979080350


##########
docs/source/java/vector.rst:
##########
@@ -268,6 +268,82 @@ For example, the code below shows how to build a 
:class:`ListVector` of int's us
      }
   }
 
+Dictionary Encoding
+===================
+
+A :class:`FieldVector` can be dictionary encoded for performance or improved 
memory efficiency. While this is most often done with :class:`VarCharVector`, 
nearly any type of vector might be encoded if there are many values, but few 
unique values.
+
+There are a few steps involved in the encoding process:
+
+1. Create a regular, un-encoded vector and populate it
+2. Create a dictionary vector of the same type as the un-encoded vector. This 
vector must have the same values, but each unique value in the un-encoded 
vector need appear here only once.
+3. Create a :class:`Dictionary`. It will contain the dictionary vector, plus a 
:class:`DictionaryEncoding` object that holds the encoding's metadata and 
settings values.
+4. Create a :class:`DictionaryEncoder`.
+5. Call the encode() method on the :class:`DictionaryEncoder` to produce an 
encoded version of the original vector.
+6. (Optional) Call the decode() method on the encoded vector to re-create the 
original values.
+
+The encoded values will be integers. Depending on how many unique values you 
have, you can use either TinyIntVector, SmallIntVector, or IntVector to hold 
them. You specify the type when you create your :class:`DictionaryEncoding` 
instance. You might wonder where those integers come from: the dictionary 
vector is a regular vector, so the value's index position in that vector is 
used as its encoded value.
+
+Another critical attribute in :class:`DictionaryEncoding` is the id. It's 
important to understand how the id is used, so we cover that later in this 
section.
+
+This result will be a new vector (for example, an IntVector) that can act in 
place of the original vector (for example, a VarCharVector). When you write the 
data in arrow format, it is both the new IntVector plus the dictionary that is 
written: you will need the dictionary later to retrieve the original values.
+
+.. code-block:: Java
+
+    // 1. create a vector for the un-encoded data and populate it
+    VarCharVector unencoded = new VarCharVector("unencoded", allocator);
+    // now put some data in it before continuing
+
+    // 2. create a vector to hold the dictionary and populate it
+    VarCharVector dictionaryVector = new VarCharVector("dictionary", 
allocator);
+
+    // 3. create a dictionary object
+    Dictionary dictionary = new Dictionary(dictionaryVector, new 
DictionaryEncoding(1L, false, null));
+
+    // 4. create a dictionary encoder
+    DictionaryEncoder encoder = new DictionaryEncoder.encode(dictionary, 
allocator);
+
+    // 5. encode the data
+    IntVector encoded = (IntVector) encoder.encode(unencoded);
+
+    // 6. re-create an un-encoded version from the encoded vector
+    VarCharVector decoded = (VarCharVector) encoder.decode(encoded)
+
+One thing we haven't discussed is how to create the dictionary vector from the 
original un-encoded values. That is left to the library user since a custom 
method will likely be more efficient than a general utility.
+
+Finally, you can package a number of dictionaries together, which is useful if 
you're working with a :class:`VectorSchemaRoot` with several dictionary-encoded 
vectors. This is done using an object called a :class:`DictionaryProvider`. as 
shown in the example below. Note that we don't put the dictionary vectors in 
the same :class:`VectorSchemaRoot` as the data vectors, as they will generally 
have fewer values.
+
+
+.. code-block:: Java
+
+    DictionaryProvider.MapDictionaryProvider provider =
+        new DictionaryProvider.MapDictionaryProvider();
+
+    provider.put(dictionary)))
+
+The :class:`DictionaryProvider` is simply a map of identifiers to 
:class:`Dictionary` objects, where each identifier is a long value. In the 
above code you will see it as the first argument to the 
:class:`DictionaryEncoding` constructor.
+
+This is where the :class:`DictionaryEncoding`'s 'id' attribute comes in. This 
value is used to connect dictionaries to instances of 
:class:`VectorSchemaRoot`, using a :class:`DictionaryProvider`.  Here's how 
that works:
+
+* The :class:`VectorSchemaRoot` has a :class:`Schema` object containing a list 
of :class:`Field` objects.
+* The field has an attribute called 'dictionary', but it holds a 
:class:`DictionaryEncoding` rather than a :class:`Dictionary`
+* As mentioned, the :class:`DictionaryProvider` holds dictionaries indexed by 
a long value. This value is the id from your :class:`DictionaryEncoding`.
+* To retrieve the dictionary for a vector in a :class:`VectorSchemaRoot`, you 
get the field associated with the vector, get its dictionary attribute, and use 
that object's id to look up the correct dictionary in the provider.
+
+.. code-block:: Java
+
+    // create the encoded vector, the Dictionary and DictionaryProvider as 
discussed above
+
+    // Create a VectorSchemaRoot with one encoded vector
+    VectorSchemaRoot vsr = new VectorSchemaRoot(List.of(encoded));
+
+    // now we want to decode our vector, so we retrieve its dictionary from 
the provider
+    Field f = vsr.getField(encoded.getName());
+    DictionaryEncoding encoding = f.getDictionary();
+    Dictionary dictionary = provider.get(encoding.getId());
+
+As you can see, a :class:`DictionaryProvider` is handy for managing the 
dictionaries associated with a :class:`VectorSchemaRoot`. More importantly, it 
helps package the dictionaries for a :class:`VectorSchemaRoot` when it's 
written. The classes :class:`ArrowFileWriter` and :class:`ArrowStreamWriter` 
both accept an optional :class:`DictionaryProvider` argument for that purpose. 
You can find example code for writing dictionaries in the documentation for 
(:ref:`ipc`).

Review Comment:
   Sorry, I wasn't suggesting they were linked directly to vector (and I was 
wrong about 1-1), but I believe a vector can be encoded by at most one 
dictionary. So my question is, if a Reader reads one vector, why does it need a 
relationship to a Map of dictionaries, rather than a single dictionary. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to