[
https://issues.apache.org/jira/browse/ARROW-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623229#comment-17623229
]
David Li commented on ARROW-18090:
----------------------------------
I'm not familiar with the Rust APIs, but in Python/C++ it's pretty
straightforward:
{code:python}
>>> import pyarrow as pa
>>> ty = pa.list_(pa.dictionary(pa.int16(), pa.string()))
>>> ty
ListType(list<item: dictionary<values=string, indices=int16, ordered=0>>)
>>> pa.array([["tag1", "tag2"], ["tag1", "tag3"]], ty)
<pyarrow.lib.ListArray object at 0x7fc4d89ca940>
[
-- dictionary:
[
"tag1",
"tag2",
"tag3"
]
-- indices:
[
0,
1
],
-- dictionary:
[
"tag1",
"tag2",
"tag3"
]
-- indices:
[
0,
2
]
]
{code}
> Dictionary Style array for Keywords or Tags
> --------------------------------------------
>
> Key: ARROW-18090
> URL: https://issues.apache.org/jira/browse/ARROW-18090
> Project: Apache Arrow
> Issue Type: New Feature
> Reporter: Sven Cattell
> Priority: Major
>
> I want to efficiently encode lists of tags for each element in my database.
> In my case I have 30 tags, and a few are assigned to each of my ~20m records.
> Here's a simplified example of 5 records:
> * pe, keylogger, cryptojack
> * pe, packed
> * pe, cryptojack, c2
> * pe, keylogger, c2
> * pe
> Right now I have to store these in a List<Utf8> and have huge amounts of
> duplicate data. The dictionary array looks almost perfect for this task. I
> just want to allow for a List<T> instead of just T for the allowed primitive
> index type in a dictionary.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)