[GitHub] [arrow] kou commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

GitBox Mon, 10 Aug 2020 14:34:09 -0700


kou commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r468193502




##########
File path: cpp/src/arrow/ipc/options.h
##########
@@ -94,6 +94,11 @@ struct ARROW_EXPORT IpcReadOptions {
   /// like decompression
   bool use_threads = true;
 
+  /// \brief Convert endian of data element to platform-native endianness
+  /// if the endianness of the received schema is not equal to
+  /// platform-native endianness
+  bool use_native_endian = true;

Review comment:
       Sorry. I suggested the `use_native_endian` name but it's not a good 
name...
   It looks like that "if use_native_endian is false, the read data uses non 
native endian".
   
   How about renaming it to `ensure_native_endian` or something?

##########
File path: cpp/src/arrow/type.h
##########
@@ -1582,13 +1583,26 @@ class ARROW_EXPORT FieldRef {
 // ----------------------------------------------------------------------
 // Schema
 
+enum class Endianness {
+  LITTLE = 0,
+  BIG = 1,
+#if ARROW_LITTLE_ENDIAN
+  NATIVE = LITTLE
+#else
+  NATIVE = BIG
+#endif
+};
+
 /// \class Schema
 /// \brief Sequence of arrow::Field objects describing the columns of a record
 /// batch or table data structure
 class ARROW_EXPORT Schema : public detail::Fingerprintable,
                             public util::EqualityComparable<Schema>,
                             public util::ToStringOstreamable<Schema> {
  public:
+  explicit Schema(std::vector<std::shared_ptr<Field>> fields, Endianness 
endianness,
+                  std::shared_ptr<const KeyValueMetadata> metadata = NULLPTR);

Review comment:
       Here are use cases I imagine:
   
     * Router: It doesn't process any Arrow data and just passes the received 
Arrow data to another system. In this case,  it's efficient that we allow 
non-native endianness in memory.
     * Saver: It doesn't process any Arrow data and just saves the received 
Arrow data to filesystem. For example, it saves the received Arrow data to S3 
as is.
   
   Does it make sense?

##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -449,7 +449,7 @@ Result<std::shared_ptr<RecordBatch>> LoadRecordBatchSubset(
     const flatbuf::RecordBatch* metadata, const std::shared_ptr<Schema>& 
schema,
     const std::vector<bool>& inclusion_mask, const DictionaryMemo* 
dictionary_memo,
     const IpcReadOptions& options, MetadataVersion metadata_version,
-    Compression::type compression, io::RandomAccessFile* file) {
+    Compression::type compression, io::RandomAccessFile* file, bool 
swap_endian = false) {

Review comment:
       I see.
   How about adding a new simple internal struct (`struct IpcReadContext`?) for 
organizing options (`IpcReadOptions`, `DictionaryMemo` and `swap_endian` for 
now)?
   
   I think that many arguments API isn't good to maintain.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] kou commented on a change in pull request #7507: ARROW-8797: [C++] Read RecordBatch in a different endian

Reply via email to