[ 
https://issues.apache.org/jira/browse/ARROW-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430029#comment-15430029
 ] 

Micah Kornfield commented on ARROW-263:
---------------------------------------

Regarding the unix domain sockets.  Yes I think you are right that there needs 
to be a centralized named file to so that communication can be bootstrapped. 

There are two types of page faults.  Minor and Major.  Minor ones are when the 
pages are in memory but not mapped properly the address space of the new 
process yet and major ones are when the page isn't in memory at all and needs 
to be loaded mapped [1].  If I understood Todd Lipcon on the thread mentioned 
above when minor page faults are accounted for, the latency of shared memory 
might not be better then just transferring over unix domain sockets.

I think I have a workable design for IPC now in my head, but not sure that I 
will get it documented tonight, and will likely  be Away from keyboard for at 
least a couple of days.  

[1] https://en.wikipedia.org/wiki/Page_fault#Minor

> Design an initial IPC mechanism for Arrow Vectors
> -------------------------------------------------
>
>                 Key: ARROW-263
>                 URL: https://issues.apache.org/jira/browse/ARROW-263
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>
> Prior discussion on this topic [1].
> Use-cases:
> 1.  User defined function (UDF) execution:  One process wants to execute a 
> user defined function written in another language (e.g. Java executing a 
> function defined in python, this involves creating Arrow Arrays in java, 
> sending them to python and receiving a new set of Arrow Arrays produced in 
> python back in the java process).
> 2.  If a storage system and a query engine are running on the same host we 
> might want use IPC instead of RPC (e.g. Apache Drill querying Apache Kudu)
> Assumptions:
> 1.  IPC mechanism should be useable from the core set of supported languages 
> (Java, Python, C) on POSIX and ideally windows systems.  Ideally, we would 
> not need to add dependencies on additional libraries outside of each 
> languages outside of this document.
> We want leverage shared memory for Arrays to avoid doubling RAM requirements 
> by duplicating the same Array in different memory locations.  
> 2. Under some circumstances shared memory might be more efficient than FIFOs 
> or sockets (in other scenarios they won’t see thread below).
> 3. Security is not a concern for V1, we assume all processes running are 
> “trusted”.
> Requirements:
> 1.Resource management: 
>     a.  Both processes need a way of allocating memory for Arrow Arrays so 
> that data can be passed from one process to another.
>     b. There must be a mechanism to cleanup unused Arrow Arrays to limit 
> resource usage but avoid race conditions when processing arrays
> 2.  Schema negotiation - before sending data, both processes need to agree on 
> schema each one will produce.
> Out of scope requirements:
> 1.  IPC channel metadata discovery is out of scope of this document.  
> Discovery can be provided by passing appropriate command line arguments, 
> configuration files or other mechanisms like RPC (in which case RPC channel 
> discovery is still an issue).
> [1] 
> http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3c8d5f7e3237b3ed47b84cf187bb17b666148e7...@shsmsx103.ccr.corp.intel.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to