arthursunbao commented on issue #10885: URL: https://github.com/apache/arrow/issues/10885#issuecomment-894037170
Hi westonpace, Thanks for your quick response. Our scenario is like this: We have a recommendation system and we want to transfer the user data from kafka and hive to online redis-like storage. We found that arrow has good columar storage capabilities and deserializes data without needing to parse the entire data content like protobuf so we use arrow to serialize and compress the user data into binary using arrowStreamWriter in kafka and hive However, in our scenario, the data schema for every user is different, so we want to keep an IDL schema in an independent management system for each user so that when the serialized data is in redis, the third party system that loads the redis arrow-serialized data can read the user's unique schema and unserialized the binary data using arrowFileReader. We dig on the arrow Java api and found that when reading data using arrowFileReader, we first need to do like this: `` RootAllocator allocator = new RootAllocator(); VectorSchemaRoot schemaRoot = VectorSchemaRoot.create(UserSchema.schema(), allocator); FileOutputStream fileOutputStream = new FileOutputStream(FILE_PATH); ArrowFileWriter arrowFileWriter = new ArrowFileWriter(schemaRoot, null, fileOutputStream.getChannel())) `` So basically if the user uses the Java sdk, he needs to keep an UserSchema.schema() java file. So what if the user wants to use C++ sdk to read the scemea, does it mean that he need to keep an C++ struct as well? Thanks in advance Jason -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
