xu2011 commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1455341709

   I bumped into a similar issue with pd.read_parquet to read from s3. Read 
performance is slow compare to pq.read_table().to_pandas() though underneath 
the pandas.read_parquet using the same function call.
   
   I found the issue could cause by 
[_ensure_filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2431)
 in _ParquetDataset class. Which [check object type for the given 
filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L99).
 When the check failed, it will reconstruct the filesystem with 
[PyFileSystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L119)
 and slow down the read.
   
   With pq.read_table().to_pandas(), it will [parse the file system from s3 
path](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2459)
 and get `pyarrow._s3fs.S3FileSystem`
   
   
   
   ```
   def _ensure_filesystem_checkinstance():
       import s3fs
       from pyarrow._fs import FileSystem
       s3 = s3fs.S3FileSystem()
       print(isinstance(s3,FileSystem))
   
   
   def fs_pandas():
       import s3fs
       from pyarrow._fs import (FileSystem, _ensure_filesystem)
       s3 = s3fs.S3FileSystem()
       fs = _ensure_filesystem(s3)
       print(fs)
      
   
   def fs_pq():
       filesystem, path_or_paths = FileSystem.from_uri(
                                   s3_path)
       print(filesystem)
   
   _ensure_filesystem_checkinstance()
   fs_pandas()
   fs_pq()
   ```
   Result
   ```
   _ensure_filesystem_checkinstance()
   False
   
   fs_pandas()
   pyarrow._fs.FileSystem
   
   fs_pq()
   pyarrow._s3.S3FileSystem
   ```
   
   I don't think it's expected behavior and I suggest reopen the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to