Thanks, this makes it much clearer to me.

Some additional diving into making custom handling for SIGSEGV can be found 
at https://stackoverflow.com/questions/24351359/mmap-for-remote-file. 
However, I will likely not go down that rabbit hole. :)

I will instead consider the recommendation to split up the problem up in 
multiple messages with externally handled framing/indexing.

Den torsdag 16 augusti 2018 kl. 20:32:10 UTC+2 skrev Kenton Varda:
>
> Hi Björn,
>
> The easiest way to make this work would be to have the data set live on a 
> network filesystem (e.g. NFS) or block device (e.g. NBD, iSCSI) which you 
> can mount on your local system and then use mmap().
>
> If mounting a remote filesystem is not an option, it is technically 
> possible to do everything in userspace instead -- but it's tricky. 
> Essentially, you can implement a memory mapping entirely in userspace by 
> writing your own signal handler for SIGSEGV. At startup, you would create 
> an anonymous memory mapping that is at least the size of your remote file, 
> and is marked to prohibit reading. When your program attempts to read from 
> this space, a SIGSEGV signal is raised. In your signal handler, you look at 
> what address the code was trying to access (from si_addr in the siginfo_t), 
> you fetch the appropriate page from the remote server, you map that page 
> into the right place in local memory, and then you mark it as readable. On 
> return from the signal handler, the code continues on with the newly-mapped 
> data.
>
> This is, of course, pretty advanced systems hacking, an unfortunately I 
> don't know of a library that does it for you (though I bet one exists... 
> somewhere).
>
> Otherwise, you need to spit your data into smaller pieces that your 
> application knows how to fetch explicitly as needed...
>
> -Kenton
>
> On Thu, Aug 16, 2018 at 10:32 AM, <[email protected] <javascript:>> 
> wrote:
>
>> Hi,
>>
>> I'm investigating using Cap'n Proto as the basis for a format containing 
>> a large collection of r-tree indexed data. The typical access pattern would 
>> be to query the index resulting in a set of nodes in the tree. The 
>> collection of data would be physically clustered on node indices so that 
>> one can efficiently seek and read the data items for the searched node 
>> indexes.
>>
>> The recommendations for random access has been to simply use mmap which I 
>> assume would work well in this case but AFAIK it's something that is only 
>> used for files readily available on attached block storage. However, in 
>> this case the full dataset might very well be too large to keep locally 
>> and the preferred access method would be streaming access over network with 
>> the same pattern of random access using index searches.
>>
>> I'm a C++ novice and I fail to understand if something remotely like this 
>> can be done already with the reference C++ implementation. Indeed, I have 
>> not even been able to understand if it supports sequential streaming access 
>> of a part of a message - it seems assumed that a message is fully read into 
>> RAM, except when using mmap which would then be the only way to partially 
>> read a message (sequential or random). But I do not want to give up yet, 
>> perhaps there is something I'm missing?
>>
>> Regards,
>>
>> Björn
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Cap'n Proto" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> Visit this group at https://groups.google.com/group/capnproto.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/capnproto.

Reply via email to