[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393337986
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
 
 Review comment:
   this file has a fair amount of test code.. any way to simply/reuse? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393290627
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+return new Path(
+InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + 
INLINE_FILE_STR + "/"
++ "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + 
LENGTH_STR + EQUALS_STR + inLineLength
+);
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String 
outerScheme) {
+String scheme = inlinePath.getParent().getParent().getName();
+Path basePath = inlinePath.getParent().getParent().getParent();
+return new Path(basePath.toString().replaceFirst(outerScheme, scheme));
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20=40".
+   * output: 20
+   *
+   * @param inlinePath
+   * @return
+   */
+  public static int startOffset(Path inlinePath) {
+String pathName = inlinePath.getName();
+return Integer.parseInt(pathName.substring(pathName.indexOf('=') + 1, 
pathName.indexOf('&')));
 
 Review comment:
   can we implement this using `split()` and then picking the ith index.. 
instead of relying on first and last? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393338790
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestParquetReadWriteFlow.java
 ##
 @@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.avro.AvroParquetReader;
+import org.apache.parquet.avro.AvroParquetWriter;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} with Parquet writer and reader.
+ */
+public class TestParquetReadWriteFlow {
 
 Review comment:
   rename `TestParquetInlining`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393337020
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
+
+  private final Configuration inMemoryConf;
+  private final Configuration inlineConf;
+  private final int minBlockSize = 1024;
+  private static final String LOCAL_FORMATTER = "%010d";
+  private int maxRows = 100 + RANDOM.nextInt(1000);
+  private Path generatedPath;
+
+  public TestHFileReadWriteFlow() {
+inMemoryConf = new Configuration();
+inMemoryConf.set("fs." + InMemoryFileSystem.SCHEME + ".impl", 
InMemoryFileSystem.class.getName());
+inlineConf = new Configuration();
+inlineConf.set("fs." + InlineFileSystem.SCHEME + ".impl", 
InlineFileSystem.class.getName());
+  }
+
+  @After
+  public void teardown() throws IOException {
+if (generatedPath != null) {
+  File filePath = new 
File(generatedPath.toString().substring(generatedPath.toString().indexOf(':') + 
1));
+  if (filePath.exists()) {
+FileSystemTestUtils.deleteFile(filePath);
+  }
+}
+  }
+
+  @Test
+  public void testSimpleInlineFileSystem() throws IOException {
+Path outerInMemFSPath = getRandomOuterInMemPath();
+Path outerPath = new Path(FILE_SCHEME + 
outerInMemFSPath.toString().substring(outerInMemFSPath.toString().indexOf(':')));
+generatedPath = outerPath;
+CacheConfig cacheConf = new CacheConfig(inMemoryConf);
+FSDataOutputStream fout = createFSOutput(outerInMemFSPath, inMemoryConf);
+HFileContext meta = new HFileContextBuilder()
+.withBlockSize(minBlockSize)
+.build();
+HFile.Writer writer = HFile.getWriterFactory(inMemoryConf, cacheConf)
+.withOutputStream(fout)
+.withFileContext(meta)
+.withComparator(new KeyValue.KVComparator())
+.create();
+
+writeRecords(writer);
+fout.close();
+
+byte[] inlineBytes = getBytesToInline(outerInMemFSPath);
+long startOffset = generateOuterFile(outerPath, inlineBytes);
+
+long inlineLength = inlineBytes.length;
+
+// Generate phantom inline file
+Path inlinePath = getPhantomFile(outerPath, startOffset, inlineLength);
+
+InlineFileSystem inlineFileSystem = (InlineFileSystem) 
inlinePath.getFileSystem(inlineConf);
+FSDataInputStream fin = inlineFileSystem.open(inlinePath);
+
+HFile.Reader reader = HFile.createReader(inlineFileSystem, inlinePath, 
cacheConf, inlineConf);
+// Load up the index.
+reader.loadFileInfo();
+// Get a scanner that caches and that does not use pread.
+HFileScanner scanner = 

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393338152
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
 
 Review comment:
   rename to `TestHFileInlining`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393338934
 
 

 ##
 File path: pom.xml
 ##
 @@ -684,6 +684,11 @@
 hbase-client
 ${hbase.version}
   
+  
+org.apache.hbase
+hbase-server
+${hbase.version}
 
 Review comment:
   scope to be test? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393336968
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
+
+  private final Configuration inMemoryConf;
+  private final Configuration inlineConf;
+  private final int minBlockSize = 1024;
+  private static final String LOCAL_FORMATTER = "%010d";
+  private int maxRows = 100 + RANDOM.nextInt(1000);
+  private Path generatedPath;
+
+  public TestHFileReadWriteFlow() {
+inMemoryConf = new Configuration();
+inMemoryConf.set("fs." + InMemoryFileSystem.SCHEME + ".impl", 
InMemoryFileSystem.class.getName());
+inlineConf = new Configuration();
+inlineConf.set("fs." + InlineFileSystem.SCHEME + ".impl", 
InlineFileSystem.class.getName());
+  }
+
+  @After
+  public void teardown() throws IOException {
+if (generatedPath != null) {
+  File filePath = new 
File(generatedPath.toString().substring(generatedPath.toString().indexOf(':') + 
1));
+  if (filePath.exists()) {
+FileSystemTestUtils.deleteFile(filePath);
+  }
+}
+  }
+
+  @Test
+  public void testSimpleInlineFileSystem() throws IOException {
+Path outerInMemFSPath = getRandomOuterInMemPath();
+Path outerPath = new Path(FILE_SCHEME + 
outerInMemFSPath.toString().substring(outerInMemFSPath.toString().indexOf(':')));
+generatedPath = outerPath;
+CacheConfig cacheConf = new CacheConfig(inMemoryConf);
+FSDataOutputStream fout = createFSOutput(outerInMemFSPath, inMemoryConf);
+HFileContext meta = new HFileContextBuilder()
+.withBlockSize(minBlockSize)
+.build();
+HFile.Writer writer = HFile.getWriterFactory(inMemoryConf, cacheConf)
+.withOutputStream(fout)
+.withFileContext(meta)
+.withComparator(new KeyValue.KVComparator())
+.create();
+
+writeRecords(writer);
+fout.close();
+
+byte[] inlineBytes = getBytesToInline(outerInMemFSPath);
+long startOffset = generateOuterFile(outerPath, inlineBytes);
+
+long inlineLength = inlineBytes.length;
+
+// Generate phantom inline file
+Path inlinePath = getPhantomFile(outerPath, startOffset, inlineLength);
+
+InlineFileSystem inlineFileSystem = (InlineFileSystem) 
inlinePath.getFileSystem(inlineConf);
+FSDataInputStream fin = inlineFileSystem.open(inlinePath);
+
+HFile.Reader reader = HFile.createReader(inlineFileSystem, inlinePath, 
cacheConf, inlineConf);
+// Load up the index.
+reader.loadFileInfo();
+// Get a scanner that caches and that does not use pread.
+HFileScanner scanner = 

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393291524
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
 
 Review comment:
   In any case, please file JIRA for these gaps after we land this 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393291311
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
 
 Review comment:
   can you expand on this? supporting parquet summary metadata is a critical 
thing for the PR IMO


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393289965
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+return new Path(
+InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + 
INLINE_FILE_STR + "/"
++ "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + 
LENGTH_STR + EQUALS_STR + inLineLength
+);
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String 
outerScheme) {
 
 Review comment:
   above example does not make it easy to understand what `outerScheme` is? is 
n't `outerScheme` always `inlinefs` if this is an inlinePath? Should we need to 
pass it in? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393336202
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+super.initialize(name, conf);
+this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws 
IOException {
+Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, 
getScheme());
+FileSystem outerFs = outerPath.getFileSystem(conf);
+FSDataInputStream outerStream = outerFs.open(outerPath, bufferSize);
+return new InlineFsDataInputStream(InLineFSUtils.startOffset(inlinePath), 
outerStream, InLineFSUtils.length(inlinePath));
+  }
+
+  @Override
+  public boolean exists(Path f) {
+try {
+  return getFileStatus(f) != null;
+} catch (Exception e) {
+  return false;
+}
+  }
+
+  @Override
+  public FileStatus getFileStatus(Path inlinePath) throws IOException {
+Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, 
getScheme());
+FileSystem outerFs = outerPath.getFileSystem(conf);
+FileStatus status = outerFs.getFileStatus(outerPath);
+FileStatus toReturn = new FileStatus(InLineFSUtils.length(inlinePath), 
status.isDirectory(), status.getReplication(), status.getBlockSize(),
 
 Review comment:
   and all of this code will read nicely.. as just getters on that pojo 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393286954
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
 
 Review comment:
   ensure that `toString()` will always yield something with the scheme? may be 
there is a more direct method for this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393288343
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
 
 Review comment:
   whats the purpose of this string? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393292323
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
 
 Review comment:
   I think the InlineFSUtils can assume this scheme and can become lot simpler 
to read, with fewer args in methods. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393288676
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+return new Path(
+InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + 
INLINE_FILE_STR + "/"
++ "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + 
LENGTH_STR + EQUALS_STR + inLineLength
 
 Review comment:
   break to a new line at `"&"` for readability? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393336002
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+super.initialize(name, conf);
+this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws 
IOException {
+Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, 
getScheme());
 
 Review comment:
   can we introduce a pojo for this? Something that can keep an outerPath, 
scheme, startOffset and length together.. we can have this utils method return 
that.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393286265
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
 
 Review comment:
   rename: getInlineFilePath()? (Prefer Inline to InLine everywhere for 
camelcasing it) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393285575
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
 
 Review comment:
   may be a better example on the second line? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393289484
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+return new Path(
+InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + 
INLINE_FILE_STR + "/"
++ "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + 
LENGTH_STR + EQUALS_STR + inLineLength
+);
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String 
outerScheme) {
+String scheme = inlinePath.getParent().getParent().getName();
+Path basePath = inlinePath.getParent().getParent().getParent();
+return new Path(basePath.toString().replaceFirst(outerScheme, scheme));
 
 Review comment:
   Similarly above, could we have just replaced `file:` with `inlinefs`, 
instead of indexOf(). 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-16 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393287633
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
 
 Review comment:
   use a different scheme like hdfs: or s3a for illustration? :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-01 Thread GitBox
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r386132436
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InMemoryFileSystem.java
 ##
 @@ -0,0 +1,120 @@
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.ByteArrayOutputStream;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.net.URI;
+import java.net.URISyntaxException;
+
+/**
+ * A FileSystem which stores all content in memory and returns a byte[] when 
{@link #getFileAsBytes()} is called
+ * This FileSystem is used only in write path. Does not support any read apis 
except {@link #getFileAsBytes()}.
+ */
+public class InMemoryFileSystem extends FileSystem {
 
 Review comment:
   can we move all of this code to `hudi-common` 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services