[
https://issues.apache.org/jira/browse/ARROW-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alessandro Molina updated ARROW-4753:
-------------------------------------
Fix Version/s: (was: 5.0.0)
6.0.0
> [C++] Extension types and layouts for text-optimized data structures
> --------------------------------------------------------------------
>
> Key: ARROW-4753
> URL: https://issues.apache.org/jira/browse/ARROW-4753
> Project: Apache Arrow
> Issue Type: Wish
> Components: C++, Format
> Environment: C/C++
> Reporter: Edmon Begoli
> Priority: Minor
> Labels: features
> Fix For: 6.0.0
>
>
> Narrative (text), by default, is notoriously inefficient to store on the disk
> or in memory. It is, in the most basic form, a long sequence of bytes with no
> indexing or other optimized layout structure.
>
> There are data structures such as
> [tries|https://en.wikipedia.org/wiki/Trie],
> [DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton],
> or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more
> efficient storage and lookup of phrases.
>
> We would like to enable arrow to serialize from/to these efficient
> structures as the format/carrier between high performance text processing
> steps which like to operate on binary data structures (lookups, spellers, or
> more advance NLP routines).
>
> so, it could be something like:
>
> *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) :
> arrow_{color}* {color:#14892c}// writes arrow as format for the specified
> encoding. This could be implicit if we could store encoding in some kind of
> manifest{color}
>
> *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) :
> string_{color}* {color:#14892c}// restores text from the arrow format, and
> from a specified encoding, same as above. {color}
>
> {color:#333333}On the dev mailing list we are discussion creation of the
> contrib folder where such features could be optionally included for
> Arrow.{color}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)