thisisnic commented on issue #46149: URL: https://github.com/apache/arrow/issues/46149#issuecomment-2834208001
Hi @r2evans, thanks for reporting this. Strange error and odd that you get 2 different behaviours between the two functions. My initial guess is that, as `read_parquet()` reads the file immediately into memory and `open_dataset()` scans the path and then later retrieves the data, there's something weird happening with the connection there. I'm not familiar with `fcntl` and asked chatGPT - now I don't know if this is accurate, but the gist of the answer is that `fcntl` is used to manipulate file descriptors and `F_RDADVISE` is an optimisation concerned with prefetching parts of an offset into cache. Here, the version of sshfs you mention doesn't support `F_RDADVISE` properly, and so it fails entirely but really what we should be doing is skipping this optimisation. If this interpretation is correct, I'm unsure whether this should be fixed on the sshfs side where the `F_RDADVISE` isn't working as expected or for arrow to fail gracefully or skip the optimisation, but pinging @pitrou to check if any of this is accurate before I suggest concrete action here. The section in the code that is the source of the error: https://github.com/apache/arrow/blob/6e84d990379a0fe20a0a89311aa864f40efded23/cpp/src/arrow/io/file.cc#L275-L282 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org