Antoine, fair point.  I just ran some perf stats using FileOutputStream vs my 
growing mmap impl.
It seems in most cases you are correct, their runtimes are basically 
equivalent.  The only time mmap beats it significantly is if there are many 
Flush calls. I have a parameter to control how many rows to buffer before 
finishing a record batch and writing it out.  Note that my mmap impl currently 
doubles its size every time its requested to grow

Testing on writing 5 double columns on 10 million rows I get the following:

MMAP:
BatchSize    Time
1                  01:24.849
10                00:08.980
100              00:02.105
1000            00:01.081
10000          00:01.101

FILE:
BatchSize    Time
1                  03:13.982
10                00:18.875
100              00:03.172
1000            00:01.137
10000          00:01.104

-----Original Message-----
From: Antoine Pitrou [mailto:anto...@python.org] 
Sent: Friday, May 11, 2018 4:54 AM
To: dev@arrow.apache.org
Subject: Re: Question about streaming to memorymapped files


If you write your own auto-growing memory mapped file implementation,
I'd be curious about performance measurements vs. FileOutputStream (and
possibly BufferedOutputStream).

mremap() and truncate() calls are not free.  Also, at some point you'll
want to unmap data already written to prevent the map from growing
endlessly.

Regards

Antoine.


Le 09/05/2018 à 17:55, Ambalu, Robert a écrit :
> I don’t use the output stream objects directly though right? Just to take a 
> step back a bit, what im trying to do is to generate streaming rows to a 
> table in realtime ( with the ability to control how many rows to batch up 
> before writing out a recordbatch )
> 
> My understanding is that to properly stream table data I need to:
> a) create an outputstream instance
> b) create a RecordBatchStreamWriter binding my strmea object to it
> c) create a RecordBatchBuilder.  As rows are added, add it to the record 
> batch builder.  When we're ready to flush, call Flust on the batchbuilder to 
> create a record batch and pass the batch to the RecordBatchStreamWriter.
> 
> I was hoping use MemoryMappedFile for a but since it doesn’t support 
> dynamically growing the mmap file I'll have to write my own impl
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:anto...@python.org] 
> Sent: Wednesday, May 09, 2018 11:42 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> As for buffering data before making a call to write(): in Arrow 0.10.0
> you'll be able to use BufferedOutputStream for this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e=
> 
> Regards
> 
> Antoine.
> 
> 
> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
>> I don’t have any offhand, no, but I would imagine that direct file writes 
>> will at some point need to make a system call, which is expensive ( fwrite 
>> might buffer before eventually making the sys call, looks like 
>> FileOutputStream uses the raw system write for every write call).
>> The current MMap io interface isn’t usable as a streaming output 
>> unfortunately, though I suppose I could just implement my own
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:solip...@pitrou.net] 
>> Sent: Wednesday, May 09, 2018 11:11 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> Do you know of any benchmark numbers / performance studies about this?
>> While it's true that a memory-mapped file avoids explicit system calls,
>> I've heard file I/O is quite well optimized, at least on Linux,
>> nowadays.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Wed, 9 May 2018 14:47:53 +0000
>> "Ambalu, Robert" <robert.amb...@point72.com> wrote:
>>> Antoine, thanks for the quick reply.
>>> You can actually grow memorymapped files with a mremap call ( and I think a 
>>> seek/write on the file ), I do this in my applications and it works fine.
>>> I want the efficiency of writing via memory maps, so would prefer to avoid 
>>> FileOutputStream
>>>
>>> -----Original Message-----
>>> From: Antoine Pitrou [mailto:anto...@python.org] 
>>> Sent: Wednesday, May 09, 2018 10:37 AM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Question about streaming to memorymapped files
>>>
>>>
>>> Hi,
>>>
>>> If you don't know the output size upfront then should probably use a
>>> FileOutputStream instead.  By definition, memory mapped files must have
>>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( 
>>>> C++ )
>>>> I think I have everything I need ( MemoryMappedFile output streamer, 
>>>> RecordBatchStreamWriter ) but I don't understand how to properly create 
>>>> the memmap file.  It looks like it requires you to preset a size to the 
>>>> file when you create it, but since ill be streaming I don't actually know 
>>>> how big a file im going to need...
>>>> Am I missing some other API point here?  Any reason why size is required 
>>>> up front and the memmap doesn't auto-grow as needed?
>>>>
>>>> Thanks in advance
>>>> - Rob
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> DISCLAIMER: This e-mail message and any attachments are intended solely 
>>>> for the use of the individual or entity to which it is addressed and may 
>>>> contain information that is confidential or legally privileged. If you are 
>>>> not the intended recipient, you are hereby notified that any 
>>>> dissemination, distribution, copying or other use of this message or its 
>>>> attachments is strictly prohibited. If you have received this message in 
>>>> error, please notify the sender immediately and permanently delete this 
>>>> message and any attachments.
>>>>
>>>>
>>>>
>>>>   
>>

Reply via email to