[ 
https://issues.apache.org/jira/browse/ARROW-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820195#comment-16820195
 ] 

Antoine Pitrou commented on ARROW-4753:
---------------------------------------

Honestly I think this might be better as a separate project or library, where 
domain experts can freely devise and iterate on the best algorithms and data 
structures. I'm skeptical about standardizing layouts for Tries or other 
specialized structures at the Arrow project level, especially since it's likely 
there are many variants to choose from.

[~wesmckinn]

> Support optionally, and as an extension, an encoding layout for 
> text-optimized data structures
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-4753
>                 URL: https://issues.apache.org/jira/browse/ARROW-4753
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>         Environment: C/C++
>            Reporter: Edmon Begoli
>            Priority: Minor
>              Labels: features
>
> Narrative (text), by default, is notoriously inefficient to store on the disk 
> or in memory. It is, in the most basic form, a long sequence of bytes with no 
> indexing or other optimized layout structure. 
>   
>  There are data structures such as 
> [tries|https://en.wikipedia.org/wiki/Trie], 
> [DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton],
>  or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more 
> efficient storage and lookup of phrases. 
>   
>  We would like to enable arrow to serialize from/to these efficient 
> structures as the format/carrier between high performance text processing 
> steps which like to operate on binary data structures (lookups, spellers, or 
> more advance NLP routines).
>   
>  so, it could be something like:
>   
>  *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : 
> arrow_{color}* {color:#14892c}// writes arrow as format for the specified 
> encoding. This could be implicit if we could store encoding in some kind of 
> manifest{color}
>   
>  *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : 
> string_{color}* {color:#14892c}// restores text from the arrow format, and 
> from a specified encoding, same as above. {color}
>   
>  {color:#333333}On the dev mailing list we are discussion creation of the 
> contrib folder where such features could be optionally included for 
> Arrow.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to