[
https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108299#comment-17108299
]
Jörn Horstmann commented on ARROW-7574:
---------------------------------------
I retested this with the current master and it seems indeed to be fixed. There
are still seeks where the file position should already be at the right
position, but doing those for every 8192 bytes should not be a problem.
{code:java}
lseek(5, 4, SEEK_SET) = 4
read(5,
"\25\0\25\260\200\200\1\25\272\354\37,\25\234\263\6\25\0\25\10\25\10\0346\0(\02000000"...,
8192) = 8192
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7f66eca90000
lseek(5, 8196, SEEK_SET) = 8196
read(5, "J N\0001J
N\00416J\320\7\0006J\240\17\0006J\320\7\0006J\320\7\0006J"..., 252765) = 252765
{code}
> [Rust] FileSource read implementation is seeking for each single byte
> ---------------------------------------------------------------------
>
> Key: ARROW-7574
> URL: https://issues.apache.org/jira/browse/ARROW-7574
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust
> Affects Versions: 0.16.0
> Reporter: Jörn Horstmann
> Priority: Major
>
> on current master branch
> {code:java}
> $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
> ...
> lseek(3, -8, SEEK_END) = 2937
> read(3, ",\10\0\0PAR1", 8192) = 8
> lseek(3, 845, SEEK_SET) = 845
> read(3, "\25\2\31\334H schema"..., 8192) = 2100
> ...
> lseek(5, 4, SEEK_SET) = 4
> read(5,
> "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000"..., 8192)
> = 2941
> lseek(5, 5, SEEK_SET) = 5
> read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000"...,
> 8192) = 2940
> lseek(5, 6, SEEK_SET) = 6
> read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000"...,
> 8192) = 2939
> lseek(5, 7, SEEK_SET) = 7
> read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000000"...,
> 8192) = 2938
> lseek(5, 8, SEEK_SET) = 8
> read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000000"..., 8192)
> = 2937
> lseek(5, 9, SEEK_SET) = 9
> read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004"..., 8192) =
> 2936
> lseek(5, 10, SEEK_SET) = 10
> read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004\30"..., 8192) =
> 2935
> {code}
> Notice the seek position being incremented by one, despite reading up to
> 8192 bytes at a time. Interestingly this does not seem to have a big
> performance impact on a local file system with linux, but becomes a problem
> when working with a custom implementation of ParquetReader, for example for
> reading from s3.
> The problem seems to be in
> {code}
> impl<R: ParquetReader> Read for FileSource<R>
> {code}
> which is unconditionally calling
> {code}
> reader.seek(SeekFrom::Start(self.start as u64))?
> {code}
> Instead it should probably keep track of the current position and only seek
> on the first read.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)