[ 
https://issues.apache.org/jira/browse/ARROW-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edmon Begoli updated ARROW-4753:
--------------------------------
    Description: 
Narrative (text), by default, is notoriously inefficient to store on the disk 
or in memory. It is, in the most basic form, a long sequence of bytes with no 
indexing or other optimized layout structure. 
  
 There are data structures such as [tries|https://en.wikipedia.org/wiki/Trie], 
[DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton],
 or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more 
efficient storage and lookup of phrases. 
  
 We would like to enable arrow to serialize from/to these efficient structures 
as the format/carrier between high performance text processing steps which like 
to operate on binary data structures (lookups, spellers, or more advance NLP 
routines).
  
 so, it could be something like:
  
 *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow_{color}* 
{color:#14892c}// writes arrow as format for the specified encoding. This could 
be implicit if we could store encoding in some kind of manifest{color}
  
 *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : string_{color}* 
{color:#14892c}// restores text from the arrow format, and from a specified 
encoding, same as above. {color}
  
 {color:#333333}On the dev mailing list we are discussion creation of the 
contrib folder where such features could be optionally included for 
Arrow.{color}

  was:
Narrative (text), by default, is notoriously inefficient to store on the disk 
or in memory. It is, in the most basic form, a long sequence of bytes with no 
indexing or other optimized layout structure. 
 
There are data structures such as [tries|https://en.wikipedia.org/wiki/Trie], 
[DAFSAs|]https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
 or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more 
efficient storage and lookup of phrases. 
 
We would like to enable arrow to serialize from/to these efficient structures 
as the format/carrier between high performance text processing steps which like 
to operate on binary data structures (lookups, spellers, or more advance NLP 
routines).
 
 
so, it could be something like:
 
*{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow_{color}* 
{color:#14892c}// writes arrow as format for the specified encoding. This could 
be implicit if we could store encoding in some kind of manifest{color}
 
*{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : string_{color}* 
{color:#14892c}// restores text from the arrow format, and from a specified 
encoding, same as above. {color}
 
{color:#333333}On the dev mailing list we are discussion creation of the 
contrib folder where such features could be optionally included for 
Arrow.{color}


> Support optionally, and as an extension, an encoding layout for 
> text-optimized data structures
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-4753
>                 URL: https://issues.apache.org/jira/browse/ARROW-4753
>             Project: Apache Arrow
>          Issue Type: Wish
>         Environment: C/C++
>            Reporter: Edmon Begoli
>            Priority: Minor
>              Labels: features
>
> Narrative (text), by default, is notoriously inefficient to store on the disk 
> or in memory. It is, in the most basic form, a long sequence of bytes with no 
> indexing or other optimized layout structure. 
>   
>  There are data structures such as 
> [tries|https://en.wikipedia.org/wiki/Trie], 
> [DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton],
>  or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more 
> efficient storage and lookup of phrases. 
>   
>  We would like to enable arrow to serialize from/to these efficient 
> structures as the format/carrier between high performance text processing 
> steps which like to operate on binary data structures (lookups, spellers, or 
> more advance NLP routines).
>   
>  so, it could be something like:
>   
>  *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : 
> arrow_{color}* {color:#14892c}// writes arrow as format for the specified 
> encoding. This could be implicit if we could store encoding in some kind of 
> manifest{color}
>   
>  *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : 
> string_{color}* {color:#14892c}// restores text from the arrow format, and 
> from a specified encoding, same as above. {color}
>   
>  {color:#333333}On the dev mailing list we are discussion creation of the 
> contrib folder where such features could be optionally included for 
> Arrow.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to