samredai commented on a change in pull request #4021:
URL: https://github.com/apache/iceberg/pull/4021#discussion_r798082837



##########
File path: python/src/iceberg/io/base.py
##########
@@ -24,7 +24,40 @@
 """
 
 from abc import ABC, abstractmethod
-from typing import Union
+from typing import Protocol, Union, runtime_checkable
+
+
+@runtime_checkable
+class InputStream(Protocol):
+    def read(self, n: int) -> bytes:
+        ...

Review comment:
       These are the equivalent of 
[SeekableInputStream](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/api/io/seekable_input_stream.py#L19)
 and 
[PositionOutputStream](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/api/io/position_output_stream.py#L19)
 in the legacy python client. It's just less strict here in that any class with 
the same structure is automatically a subclass.
   
   I think this highlights the behavior:
   ```py
   from typing import Protocol, runtime_checkable
   
   @runtime_checkable
   class A(Protocol):
       def foo(self):
           ...
       def bar(self):
           ...
   
   class B:
       def foo(self):
           ...
       def bar(self):
           ...
   
   class C:
       def foo(self):
           ...
       def bar(self):
           ...
       def baz(self):
           ...
   
   class D:
       def foo(self):
           ...
   
   isinstance(B(), A)  # True
   isinstance(C(), A)  # True
   isinstance(D(), A)  # False
   ```
   
   So an input file or output file class could have more methods, as long as it 
has the required methods set in the protocol which should be the case for any 
file-like object that's based on 
[RawIOBase](https://docs.python.org/3/library/io.html#io.RawIOBase) (inherits 
from IOBase and adds a `read`, `readall`, `readinto`, and `write` method).
   
   The reason I renamed a few of the methods from the legacy client was also to 
be in-line with `RawIOBase`. Here's an example of how urllibs HTTPResponse 
object matches this protocol:
   ```py
   import urllib.request
   
   response = urllib.request.urlopen('http://python.org/')
   isinstance(response, InputStream)  # True
   
   print(dir(response))
   ```
   output:
   ```
   [...
    'begin',
    'chunk_left',
    'chunked',
    'close',
    'closed',
    'code',
    'debuglevel',
    'detach',
    'fileno',
    'flush',
    'fp',
    'getcode',
    'getheader',
    'getheaders',
    'geturl',
    'headers',
    'info',
    'isatty',
    'isclosed',
    'length',
    'msg',
    'peek',
    'read',
    'read1',
    'readable',
    'readinto',
    'readinto1',
    'readline',
    'readlines',
    'reason',
    'seek',
    'seekable',
    'status',
    'tell',
    'truncate',
    'url',
    'version',
    'will_close',
    'writable',
    'write',
    'writelines']
   ```
   
   One of the remaining questions though is where do we set this `isinstance` 
check, or do we do it at all? The rest of the library will be assuming this 
interface so do we leave it to every `InputFile` and `OutputFile` 
implementations to do an `isinstance` check? Or do we do the `isinstance` check 
in other parts of the library wherever `FileIO.new_input_file(...)` is called, 
for example. This feels like it might be a tradeoff to this approach in 
addition to lack of python 3.7 support.
   
   (sorry for the super long comment)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to