[jira] [Created] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.
Vivek Nadkarni created AVRO-1088: Summary: Avro-C - Add performance tests for schema resolution and arrays. Key: AVRO-1088 URL: https://issues.apache.org/jira/browse/AVRO-1088 Project: Avro Issue Type: Improvement Components: c Affects Versions: 1.7.0 Environment: Ubuntu Linux 11.10 Reporter: Vivek Nadkarni Fix For: 1.7.0 The current performance test in Avro-C measures the performance while reading and writing of Avro values using a complex record schema, which does not contain any arrays. We add tests to measure the performance for simple and nested arrays. We also replicate all tests to measure the performance of the schema resolution using a resolved reader and a resolved writer. Specifically we add the following performance tests: Nested Record 1. Replicating the test nested record value by index, using a helper function. Using helper functions adds a little overhead, but it allows us to test various schemas, as well as different modes of schema resolution much more easily. 2. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a complex record. 3. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a complex record. Simple Array 4. Test the performance for reading and writing a simple array. 5. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a simple array. 6. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a simple array. Nested Array 7. Test the performance for reading and writing a nested array. 8. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a nested array. 9. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a nested array. Additionally we fix a minor bug: 1. The return value of avro_value_equal_fast() was not being tested. Test this return value, and fail if it is FALSE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.
[ https://issues.apache.org/jira/browse/AVRO-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Nadkarni updated AVRO-1088: - Attachment: AVRO-1088.patch Uploading patch file implementing the new performance tests. Avro-C - Add performance tests for schema resolution and arrays. Key: AVRO-1088 URL: https://issues.apache.org/jira/browse/AVRO-1088 Project: Avro Issue Type: Improvement Components: c Affects Versions: 1.7.0 Environment: Ubuntu Linux 11.10 Reporter: Vivek Nadkarni Fix For: 1.7.0 Attachments: AVRO-1088.patch Original Estimate: 24h Remaining Estimate: 24h The current performance test in Avro-C measures the performance while reading and writing of Avro values using a complex record schema, which does not contain any arrays. We add tests to measure the performance for simple and nested arrays. We also replicate all tests to measure the performance of the schema resolution using a resolved reader and a resolved writer. Specifically we add the following performance tests: Nested Record 1. Replicating the test nested record value by index, using a helper function. Using helper functions adds a little overhead, but it allows us to test various schemas, as well as different modes of schema resolution much more easily. 2. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a complex record. 3. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a complex record. Simple Array 4. Test the performance for reading and writing a simple array. 5. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a simple array. 6. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a simple array. Nested Array 7. Test the performance for reading and writing a nested array. 8. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a nested array. 9. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a nested array. Additionally we fix a minor bug: 1. The return value of avro_value_equal_fast() was not being tested. Test this return value, and fail if it is FALSE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.
[ https://issues.apache.org/jira/browse/AVRO-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Nadkarni updated AVRO-1088: - Status: Patch Available (was: Open) I ran the performance tests and got the results appended below. The results show that, as expected, there is a slight performance hit for using a resolved writer or resolved reader for the complex record, compared to using the matched schemas. However, the results also show that for the simple array and for the nested array, the penalty for using the resolved writer is substantial. Using the resolved writer takes 30 to 50 times longer than using no schema resolution or using the resolved reader for simple and nested arrays. The performance results indicate that there is a likely bug in the resolved writer, when it is trying to resolve simple or nested arrays. This bug will be reported in a separate AVRO-JIRA issue. Running refcount 1 tests per run Run 1 Run 2 Run 3 Average time: 2.423s Tests/sec:41265475 Running nested record (legacy) 10 tests per run Run 1 Run 2 Run 3 Average time: 2.270s Tests/sec:44053 Running nested record (value by index) 100 tests per run Run 1 Run 2 Run 3 Average time: 2.077s Tests/sec:481541 Running nested record (value by name) 100 tests per run Run 1 Run 2 Run 3 Average time: 2.333s Tests/sec:428571 Running nested record (value by index) matched schemas 100 tests per run Run 1 Run 2 Run 3 Average time: 2.147s Tests/sec:465839 Running nested record (value by index) resolved writer 100 tests per run Run 1 Run 2 Run 3 Average time: 2.480s Tests/sec:403226 Running nested record (value by index) resolved reader 100 tests per run Run 1 Run 2 Run 3 Average time: 2.230s Tests/sec:448430 Running simple array matched schemas 25 tests per run Run 1 Run 2 Run 3 Average time: 2.123s Tests/sec:117739 Running simple array resolved writer 1 tests per run Run 1 Run 2 Run 3 Average time: 2.747s Tests/sec:3641 Running simple array resolved reader 25 tests per run Run 1 Run 2 Run 3 Average time: 2.270s Tests/sec:110132 Running nested array matched schemas 25 tests per run Run 1 Run 2 Run 3 Average time: 3.030s Tests/sec:82508 Running nested array resolved writer 1 tests per run Run 1 Run 2 Run 3 Average time: 6.650s Tests/sec:1504 Running simple array resolved reader 25 tests per run Run 1 Run 2 Run 3 Average time: 3.313s Tests/sec:75453 Avro-C - Add performance tests for schema resolution and arrays. Key: AVRO-1088 URL: https://issues.apache.org/jira/browse/AVRO-1088 Project: Avro Issue Type: Improvement Components: c Affects Versions: 1.7.0 Environment: Ubuntu Linux 11.10 Reporter: Vivek Nadkarni Fix For: 1.7.0 Attachments: AVRO-1088.patch Original Estimate: 24h Remaining Estimate: 24h The current performance test in Avro-C measures the performance while reading and writing of Avro values using a complex record schema, which does not contain any arrays. We add tests to measure the performance for simple and nested arrays. We also replicate all tests to measure the performance of the schema resolution using a resolved reader and a resolved writer. Specifically we add the following performance tests: Nested Record 1. Replicating the test nested record value by index, using a helper function. Using helper functions adds a little overhead, but it allows us to test various schemas, as well as different modes of schema resolution much more easily. 2. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a complex record. 3. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a complex record. Simple Array 4. Test the performance for reading and writing a simple array. 5. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a simple array. 6. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a simple array. Nested Array 7. Test the performance for reading and writing a nested array. 8. Using a resolved writer to resolve between (identical) reader and writer schemas, while reading a nested array. 9. Using a resolved reader to resolve between (identical) reader and writer schemas, while writing a nested array. Additionally we fix a minor bug: 1.
[jira] [Created] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays
Vivek Nadkarni created AVRO-1089: Summary: Avro-C - Penalty 30x to 50x for using resolved writer on arrays Key: AVRO-1089 URL: https://issues.apache.org/jira/browse/AVRO-1089 Project: Avro Issue Type: Bug Components: c Affects Versions: 1.6.3, 1.7.0 Environment: Ubuntu Linux Reporter: Vivek Nadkarni Fix For: 1.7.0 The new performance tests created in AVRO-1088 show that using the resolved writer takes 30 to 50 times longer than using no schema resolution or using the resolved reader for simple and nested arrays. For a simple array, using the resolved writer took ~30x longer than using the memory reader that assumed a matching schema. For the nested array, using the resolved writer took ~50x longer. These results suggest that there is a bug in resolved writer. I do not have a proposed fix at this time. Running simple array matched schemas 25 tests per run Run 1 Run 2 Run 3 Average time: 2.123s Tests/sec:117739 Running simple array resolved writer 1 tests per run Run 1 Run 2 Run 3 Average time: 2.747s Tests/sec:3641 Running nested array matched schemas 25 tests per run Run 1 Run 2 Run 3 Average time: 3.030s Tests/sec:82508 Running nested array resolved writer 1 tests per run Run 1 Run 2 Run 3 Average time: 6.650s Tests/sec:1504 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays
[ https://issues.apache.org/jira/browse/AVRO-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Nadkarni updated AVRO-1089: - Attachment: AVRO-1089-performance.png This screenshot was generated using kcachegrind, after running the performance test test_simple_array_resolved_writer(). The plot shows that the majority of the time (97%) is spent in the function avro_resolved_writer_free_elements() called by avro_resolved_array_writer_reset(). This information suggests that the bug lies in one of these two functions. Unfortunately, I still don't have a mechanism or a fix for this issue. Avro-C - Penalty 30x to 50x for using resolved writer on arrays --- Key: AVRO-1089 URL: https://issues.apache.org/jira/browse/AVRO-1089 Project: Avro Issue Type: Bug Components: c Affects Versions: 1.6.3, 1.7.0 Environment: Ubuntu Linux Reporter: Vivek Nadkarni Fix For: 1.7.0 Attachments: AVRO-1089-performance.png Original Estimate: 48h Remaining Estimate: 48h The new performance tests created in AVRO-1088 show that using the resolved writer takes 30 to 50 times longer than using no schema resolution or using the resolved reader for simple and nested arrays. For a simple array, using the resolved writer took ~30x longer than using the memory reader that assumed a matching schema. For the nested array, using the resolved writer took ~50x longer. These results suggest that there is a bug in resolved writer. I do not have a proposed fix at this time. Running simple array matched schemas 25 tests per run Run 1 Run 2 Run 3 Average time: 2.123s Tests/sec:117739 Running simple array resolved writer 1 tests per run Run 1 Run 2 Run 3 Average time: 2.747s Tests/sec:3641 Running nested array matched schemas 25 tests per run Run 1 Run 2 Run 3 Average time: 3.030s Tests/sec:82508 Running nested array resolved writer 1 tests per run Run 1 Run 2 Run 3 Average time: 6.650s Tests/sec:1504 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file
Catalin Alexandru Zamfir created AVRO-1090: -- Summary: DataFileWriter should expose sync marker to allow concurrent writes to same .avro file Key: AVRO-1090 URL: https://issues.apache.org/jira/browse/AVRO-1090 Project: Avro Issue Type: Bug Affects Versions: 1.6.3 Reporter: Catalin Alexandru Zamfir We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our: TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two nodes can access the same path. The race condition when these streams are written, is resolved with a check to see if the file exists (has been created) by a faster thread. If that's so, it appends, instead of creating the file on the HDFS. The problem is that DataFileWriter, generates a 16-byte, random string for each instance. So, two threads with 2 different writer instances, have a different sync marker. That means that data, when trying to read it back, will get an IOException (Invalid sync!). There's a big performance penalty here. Because only one writer can write at once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took us 4 hours to generate load. With 20 concurrent threads, it took only 12.5 minutes. If DataFileWriter would expose the sync marker, a developer could read that and make sure that the next thread that appends to the file, uses the same sync marker. Don't know if it's even possible to expose the sync marker so as other instances of DataFileWriter can share the sync marker, from the file. We have a fix for this, making sure each writer is an unique instance and generating a path based on that uniqueness. But instead of having SomePath/2012/05/14/Shard.avro we'd now have SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that write the data in. If it can be done, it would be a huge fix for a bottleneck problem. The bottleneck being the single writer that can write to a single path. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers
[ https://issues.apache.org/jira/browse/AVRO-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Fuller updated AVRO-1081: Attachment: patch.diff.txt ByteBufferTest.java Attached one possible way of fixing this. I am not 100% happy with the solution, but it should work for our case for now. We are writing several avro files concurrently from within a heavily multithreaded application, and cannot afford to load many of the files into memory at once at that point. When reading the files again (in hadoop m/r job) it's ok for now to read the bytes back to memory (will read them into a byte array, but for the purpose of the test wanted to put them into a byte buffer). Am also now wondering whether you might consider direct support for File (or something like FileData) in avro. This would allow a way to include into avro items which exceed the amount of available memory. But maybe it's not a requirement anybody else would have... GenericDatumWriter does not support native ByteBuffers -- Key: AVRO-1081 URL: https://issues.apache.org/jira/browse/AVRO-1081 Project: Avro Issue Type: Bug Affects Versions: 1.6.3 Reporter: Robert Fuller Attachments: ByteBufferTest.java, ByteBufferTest.java, patch.diff.txt, patch.diff.txt An exception is thrown when trying to encode bytes backed by a file. java.lang.UnsupportedOperationException: null at java.nio.ByteBuffer.arrayOffset(ByteBuffer.java:968) ~[na:1.6.0_31] at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:61) ~[avro-1.6.3.jar:1.6.3] Note arrayOffset is an optional method, see: http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#arrayOffset%28%29 FileChannel returns native ByteBuffer not HeapedByteBuffer See here: http://mail-archives.apache.org/mod_mbox/avro-user/201202.mbox/%3ccb57f421.6bfe2%25sc...@richrelevance.com%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers
[ https://issues.apache.org/jira/browse/AVRO-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Fuller updated AVRO-1081: Attachment: patch.diff.txt ByteBufferTest.java Attached one possible way of fixing this. I am not 100% happy with the solution, but it should work for our case for now. We are writing several avro files concurrently from within a heavily multithreaded application, and cannot afford to load many of the files into memory at once at that point. When reading the files again (in hadoop m/r job) it's ok for now to read the bytes back to memory (will read them into a byte array, but for the purpose of the test wanted to put them into a byte buffer). Am also now wondering whether you might consider direct support for File (or something like FileData) in avro. This would allow a way to include into avro items which exceed the amount of available memory. But maybe it's not a requirement anybody else would have... GenericDatumWriter does not support native ByteBuffers -- Key: AVRO-1081 URL: https://issues.apache.org/jira/browse/AVRO-1081 Project: Avro Issue Type: Bug Affects Versions: 1.6.3 Reporter: Robert Fuller Attachments: ByteBufferTest.java, ByteBufferTest.java, patch.diff.txt, patch.diff.txt An exception is thrown when trying to encode bytes backed by a file. java.lang.UnsupportedOperationException: null at java.nio.ByteBuffer.arrayOffset(ByteBuffer.java:968) ~[na:1.6.0_31] at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:61) ~[avro-1.6.3.jar:1.6.3] Note arrayOffset is an optional method, see: http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#arrayOffset%28%29 FileChannel returns native ByteBuffer not HeapedByteBuffer See here: http://mail-archives.apache.org/mod_mbox/avro-user/201202.mbox/%3ccb57f421.6bfe2%25sc...@richrelevance.com%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file
[ https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274584#comment-13274584 ] Catalin Alexandru Zamfir commented on AVRO-1090: From the source-code of DatumFileWriter, the appendTo method seems to do what it's intended. But only accepts a (File) argument. In the case where one writer connects over the network to Hadoop and needs to write a FSDataOutputStream instead of a file, the advantages of the appendTo method cannot be used. So it seems it is possible to retrieve the sync marker from an existing .avro file and write forward with the same marker. Can this be done here also? Can an appendTo (FSDataOutputStream) method be created? This would allow concurrent writers to create or append on the same output stream using the same marker, thus enabling the data to be read back. DataFileWriter should expose sync marker to allow concurrent writes to same .avro file Key: AVRO-1090 URL: https://issues.apache.org/jira/browse/AVRO-1090 Project: Avro Issue Type: Bug Affects Versions: 1.6.3 Reporter: Catalin Alexandru Zamfir We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our: TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two nodes can access the same path. The race condition when these streams are written, is resolved with a check to see if the file exists (has been created) by a faster thread. If that's so, it appends, instead of creating the file on the HDFS. The problem is that DataFileWriter, generates a 16-byte, random string for each instance. So, two threads with 2 different writer instances, have a different sync marker. That means that data, when trying to read it back, will get an IOException (Invalid sync!). There's a big performance penalty here. Because only one writer can write at once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took us 4 hours to generate load. With 20 concurrent threads, it took only 12.5 minutes. If DataFileWriter would expose the sync marker, a developer could read that and make sure that the next thread that appends to the file, uses the same sync marker. Don't know if it's even possible to expose the sync marker so as other instances of DataFileWriter can share the sync marker, from the file. We have a fix for this, making sure each writer is an unique instance and generating a path based on that uniqueness. But instead of having SomePath/2012/05/14/Shard.avro we'd now have SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that write the data in. If it can be done, it would be a huge fix for a bottleneck problem. The bottleneck being the single writer that can write to a single path. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file
[ https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274592#comment-13274592 ] Catalin Alexandru Zamfir commented on AVRO-1090: Also, we're doing objRecordWriter.create (getHdfs ().append (objPath)) which should make a DataFileWriter on the FSDataOutputStream which respects the first sync marker of the written file. So: if thread #1 has already created the file, thread #2 can now append to the given path. But because it appends, it does not need to generateSync on the sync marker. Instead, it can read the sync marker from the already generated file and use it as it's own sync marker. This does not happen. The fact that we get an Invalid Sync because of the fact we are creating multiple writers in different threads, even if one thread finishes first and create the given path, the next thread that should append to it does not seem to take in account the fact that it should first read the existing sync marker defined with the file. DataFileWriter should take in account that a file/path/stream written by it already contains a sync marker, and that there's no need to generate another one. It should get the existing sync marker and use that to append the data to the given HDFS path. Thanks for your patience. DataFileWriter should expose sync marker to allow concurrent writes to same .avro file Key: AVRO-1090 URL: https://issues.apache.org/jira/browse/AVRO-1090 Project: Avro Issue Type: Bug Affects Versions: 1.6.3 Reporter: Catalin Alexandru Zamfir We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our: TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two nodes can access the same path. The race condition when these streams are written, is resolved with a check to see if the file exists (has been created) by a faster thread. If that's so, it appends, instead of creating the file on the HDFS. The problem is that DataFileWriter, generates a 16-byte, random string for each instance. So, two threads with 2 different writer instances, have a different sync marker. That means that data, when trying to read it back, will get an IOException (Invalid sync!). There's a big performance penalty here. Because only one writer can write at once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took us 4 hours to generate load. With 20 concurrent threads, it took only 12.5 minutes. If DataFileWriter would expose the sync marker, a developer could read that and make sure that the next thread that appends to the file, uses the same sync marker. Don't know if it's even possible to expose the sync marker so as other instances of DataFileWriter can share the sync marker, from the file. We have a fix for this, making sure each writer is an unique instance and generating a path based on that uniqueness. But instead of having SomePath/2012/05/14/Shard.avro we'd now have SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that write the data in. If it can be done, it would be a huge fix for a bottleneck problem. The bottleneck being the single writer that can write to a single path. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation
[ https://issues.apache.org/jira/browse/AVRO-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thiruvalluvan M. G. updated AVRO-1079: -- Assignee: Thiruvalluvan M. G. Status: Patch Available (was: Open) C++ Generator, improve include guard generation --- Key: AVRO-1079 URL: https://issues.apache.org/jira/browse/AVRO-1079 Project: Avro Issue Type: Improvement Components: c++ Affects Versions: 1.6.3, 1.7.0 Reporter: falk.trist...@cae.de Assignee: Thiruvalluvan M. G. Priority: Minor Attachments: AVRO-1079.patch Original Estimate: 5h Remaining Estimate: 5h We integrated avro into our cmake build system. So we have json - c++ header build step which is quite similar to the QT moc-compiler. This build step will overwrite the target header-files if the json schema file was modified or when the generated header files are different. In addition to that we want to put the generated header files under version control. However, the generated include-guards contains some random parts. This random part troubles the version control system. In addition to that this random part leads to unnecessary rebuilds as cmake 'thinks' important files have been changed and triggers a rebuild of the corresponding dependencies. Suggestion: Add an additional command line parameter to either switch off the random part of the string, or to take a string from the command line and use this as include guard. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation
[ https://issues.apache.org/jira/browse/AVRO-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thiruvalluvan M. G. updated AVRO-1079: -- Attachment: AVRO-1079.patch This patch tries to reuse the include guard if the output file already exists. C++ Generator, improve include guard generation --- Key: AVRO-1079 URL: https://issues.apache.org/jira/browse/AVRO-1079 Project: Avro Issue Type: Improvement Components: c++ Affects Versions: 1.6.3, 1.7.0 Reporter: falk.trist...@cae.de Priority: Minor Attachments: AVRO-1079.patch Original Estimate: 5h Remaining Estimate: 5h We integrated avro into our cmake build system. So we have json - c++ header build step which is quite similar to the QT moc-compiler. This build step will overwrite the target header-files if the json schema file was modified or when the generated header files are different. In addition to that we want to put the generated header files under version control. However, the generated include-guards contains some random parts. This random part troubles the version control system. In addition to that this random part leads to unnecessary rebuilds as cmake 'thinks' important files have been changed and triggers a rebuild of the corresponding dependencies. Suggestion: Add an additional command line parameter to either switch off the random part of the string, or to take a string from the command line and use this as include guard. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file
[ https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Catalin Alexandru Zamfir updated AVRO-1090: --- Description: We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our: TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two nodes can access the same path. The race condition when these streams are written, is resolved with a check to see if the file exists (has been created) by a faster thread. If that's so, it appends, instead of creating the file on the HDFS. The problem is that DataFileWriter, generates a 16-byte, random string for each instance. So, two threads with 2 different writer instances, have a different sync marker. That means that data, when trying to read it back, will get an IOException (Invalid sync!). There's a big performance penalty here. Because only one writer can write at once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took us 4 hours to generate load. With 20 concurrent threads, it took only 12.5 minutes. If DataFileWriter would expose the sync marker, a developer could read that and make sure that the next thread that appends to the file, uses the same sync marker. Don't know if it's even possible to expose the sync marker so as other instances of DataFileWriter can share the sync marker, from the file. We have a fix for this, making sure each writer is an unique instance and generating a path based on that uniqueness. But instead of having SomePath/2012/05/14/Shard.avro we'd now have SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that write the data in. If it can be done, it would be a huge fix for a bottleneck problem. The bottleneck being the single writer that can write to a single path. THIS HAS ALSO been requested on the avro-user thread: http://grokbase.com/t/avro/user/122m4sjm1y/is-it-possible-to-append-to-an-already-existing-avro-file I just could not find the JIRA ticket for this request. was: We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our: TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two nodes can access the same path. The race condition when these streams are written, is resolved with a check to see if the file exists (has been created) by a faster thread. If that's so, it appends, instead of creating the file on the HDFS. The problem is that DataFileWriter, generates a 16-byte, random string for each instance. So, two threads with 2 different writer instances, have a different sync marker. That means that data, when trying to read it back, will get an IOException (Invalid sync!). There's a big performance penalty here. Because only one writer can write at once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took us 4 hours to generate load. With 20 concurrent threads, it took only 12.5 minutes. If DataFileWriter would expose the sync marker, a developer could read that and make sure that the next thread that appends to the file, uses the same sync marker. Don't know if it's even possible to expose the sync marker so as other instances of DataFileWriter can share the sync marker, from the file. We have a fix for this, making sure each writer is an unique instance and generating a path based on that uniqueness. But instead of having SomePath/2012/05/14/Shard.avro we'd now have SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that write the data in. If it can be done, it would be a huge fix for a bottleneck problem. The bottleneck being the single writer that can write to a single path. DataFileWriter should expose sync marker to allow concurrent writes to same .avro file Key: AVRO-1090 URL: https://issues.apache.org/jira/browse/AVRO-1090 Project: Avro Issue Type: Bug Affects Versions: 1.6.3 Reporter: Catalin Alexandru Zamfir We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our: TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two nodes can access the same path. The race condition when these streams are written, is resolved with a check to see if the file exists (has been created) by a faster thread. If that's so,
[jira] [Commented] (AVRO-1085) Fingerprinting for C#
[ https://issues.apache.org/jira/browse/AVRO-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274743#comment-13274743 ] Thiruvalluvan M. G. commented on AVRO-1085: --- +1. Looks good to me. Built and tested on Windows 7, Visual C# Express 2010. Fingerprinting for C# - Key: AVRO-1085 URL: https://issues.apache.org/jira/browse/AVRO-1085 Project: Avro Issue Type: New Feature Components: csharp Affects Versions: 1.7.0 Reporter: Eric Hauser Fix For: 1.7.0 Attachments: AVRO-1085.patch Avro fingerprinting for C#. This is a direct port of the Java SchemaNormalization class and tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira