[jira] [Created] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

2012-05-14 Thread Vivek Nadkarni (JIRA)
Vivek Nadkarni created AVRO-1088:


 Summary: Avro-C - Add performance tests for schema resolution and 
arrays.
 Key: AVRO-1088
 URL: https://issues.apache.org/jira/browse/AVRO-1088
 Project: Avro
  Issue Type: Improvement
  Components: c
Affects Versions: 1.7.0
 Environment: Ubuntu Linux 11.10
Reporter: Vivek Nadkarni
 Fix For: 1.7.0


The current performance test in Avro-C measures the performance while
reading and writing of Avro values using a complex record schema,
which does not contain any arrays.

We add tests to measure the performance for simple and nested
arrays. We also replicate all tests to measure the performance of the
schema resolution using a resolved reader and a resolved writer.

Specifically we add the following performance tests:

Nested Record
1. Replicating the test nested record value by index, using a helper
   function. Using helper functions adds a little overhead, but it
   allows us to test various schemas, as well as different modes of
   schema resolution much more easily.
2. Using a resolved writer to resolve between (identical) reader and
   writer schemas, while reading a complex record.
3. Using a resolved reader to resolve between (identical) reader and
   writer schemas, while writing a complex record.

Simple Array
4. Test the performance for reading and writing a simple array.
5. Using a resolved writer to resolve between (identical) reader and
   writer schemas, while reading a simple array.
6. Using a resolved reader to resolve between (identical) reader and
   writer schemas, while writing a simple array.

Nested Array
7. Test the performance for reading and writing a nested array.
8. Using a resolved writer to resolve between (identical) reader and
   writer schemas, while reading a nested array.
9. Using a resolved reader to resolve between (identical) reader and
   writer schemas, while writing a nested array.

Additionally we fix a minor bug:
1. The return value of avro_value_equal_fast() was not being
   tested. Test this return value, and fail if it is FALSE.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

2012-05-14 Thread Vivek Nadkarni (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Nadkarni updated AVRO-1088:
-

Attachment: AVRO-1088.patch

Uploading patch file implementing the new performance tests. 


 Avro-C - Add performance tests for schema resolution and arrays.
 

 Key: AVRO-1088
 URL: https://issues.apache.org/jira/browse/AVRO-1088
 Project: Avro
  Issue Type: Improvement
  Components: c
Affects Versions: 1.7.0
 Environment: Ubuntu Linux 11.10
Reporter: Vivek Nadkarni
 Fix For: 1.7.0

 Attachments: AVRO-1088.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 The current performance test in Avro-C measures the performance while
 reading and writing of Avro values using a complex record schema,
 which does not contain any arrays.
 We add tests to measure the performance for simple and nested
 arrays. We also replicate all tests to measure the performance of the
 schema resolution using a resolved reader and a resolved writer.
 Specifically we add the following performance tests:
 Nested Record
 1. Replicating the test nested record value by index, using a helper
function. Using helper functions adds a little overhead, but it
allows us to test various schemas, as well as different modes of
schema resolution much more easily.
 2. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a complex record.
 3. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a complex record.
 Simple Array
 4. Test the performance for reading and writing a simple array.
 5. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a simple array.
 6. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a simple array.
 Nested Array
 7. Test the performance for reading and writing a nested array.
 8. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a nested array.
 9. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a nested array.
 Additionally we fix a minor bug:
 1. The return value of avro_value_equal_fast() was not being
tested. Test this return value, and fail if it is FALSE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

2012-05-14 Thread Vivek Nadkarni (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Nadkarni updated AVRO-1088:
-

Status: Patch Available  (was: Open)

I ran the performance tests and got the results appended below.

The results show that, as expected, there is a slight performance hit
for using a resolved writer or resolved reader for the complex record,
compared to using the matched schemas.

However, the results also show that for the simple array and for the
nested array, the penalty for using the resolved writer is
substantial. Using the resolved writer takes 30 to 50 times longer
than using no schema resolution or using the resolved reader for
simple and nested arrays.

The performance results indicate that there is a likely bug in the
resolved writer, when it is trying to resolve simple or nested
arrays. This bug will be reported in a separate AVRO-JIRA issue.


 Running refcount 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.423s
  Tests/sec:41265475
 Running nested record (legacy) 
  10 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.270s
  Tests/sec:44053
 Running nested record (value by index) 
  100 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.077s
  Tests/sec:481541
 Running nested record (value by name) 
  100 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.333s
  Tests/sec:428571
 Running nested record (value by index) matched schemas 
  100 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.147s
  Tests/sec:465839
 Running nested record (value by index) resolved writer 
  100 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.480s
  Tests/sec:403226
 Running nested record (value by index) resolved reader 
  100 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.230s
  Tests/sec:448430
 Running simple array matched schemas 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.123s
  Tests/sec:117739
 Running simple array resolved writer 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.747s
  Tests/sec:3641
 Running simple array resolved reader 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.270s
  Tests/sec:110132
 Running nested array matched schemas 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 3.030s
  Tests/sec:82508
 Running nested array resolved writer 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 6.650s
  Tests/sec:1504
 Running simple array resolved reader 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 3.313s
  Tests/sec:75453



 Avro-C - Add performance tests for schema resolution and arrays.
 

 Key: AVRO-1088
 URL: https://issues.apache.org/jira/browse/AVRO-1088
 Project: Avro
  Issue Type: Improvement
  Components: c
Affects Versions: 1.7.0
 Environment: Ubuntu Linux 11.10
Reporter: Vivek Nadkarni
 Fix For: 1.7.0

 Attachments: AVRO-1088.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 The current performance test in Avro-C measures the performance while
 reading and writing of Avro values using a complex record schema,
 which does not contain any arrays.
 We add tests to measure the performance for simple and nested
 arrays. We also replicate all tests to measure the performance of the
 schema resolution using a resolved reader and a resolved writer.
 Specifically we add the following performance tests:
 Nested Record
 1. Replicating the test nested record value by index, using a helper
function. Using helper functions adds a little overhead, but it
allows us to test various schemas, as well as different modes of
schema resolution much more easily.
 2. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a complex record.
 3. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a complex record.
 Simple Array
 4. Test the performance for reading and writing a simple array.
 5. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a simple array.
 6. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a simple array.
 Nested Array
 7. Test the performance for reading and writing a nested array.
 8. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a nested array.
 9. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a nested array.
 Additionally we fix a minor bug:
 1. 

[jira] [Created] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays

2012-05-14 Thread Vivek Nadkarni (JIRA)
Vivek Nadkarni created AVRO-1089:


 Summary: Avro-C - Penalty 30x to 50x for using resolved writer on 
arrays
 Key: AVRO-1089
 URL: https://issues.apache.org/jira/browse/AVRO-1089
 Project: Avro
  Issue Type: Bug
  Components: c
Affects Versions: 1.6.3, 1.7.0
 Environment: Ubuntu Linux
Reporter: Vivek Nadkarni
 Fix For: 1.7.0


The new performance tests created in AVRO-1088 show that using the
resolved writer takes 30 to 50 times longer than using no schema
resolution or using the resolved reader for simple and nested arrays.

For a simple array, using the resolved writer took ~30x longer than
using the memory reader that assumed a matching schema. For the nested
array, using the resolved writer took ~50x longer.

These results suggest that there is a bug in resolved writer. I do not
have a proposed fix at this time.


 Running simple array matched schemas 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.123s
  Tests/sec:117739
 Running simple array resolved writer 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.747s
  Tests/sec:3641


 Running nested array matched schemas 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 3.030s
  Tests/sec:82508
 Running nested array resolved writer 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 6.650s
  Tests/sec:1504



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays

2012-05-14 Thread Vivek Nadkarni (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Nadkarni updated AVRO-1089:
-

Attachment: AVRO-1089-performance.png

This screenshot was generated using kcachegrind, after running the
performance test test_simple_array_resolved_writer(). The plot shows
that the majority of the time (97%) is spent in the function
avro_resolved_writer_free_elements() called by
avro_resolved_array_writer_reset(). This information suggests that the
bug lies in one of these two functions. Unfortunately, I still don't
have a mechanism or a fix for this issue. 



 Avro-C - Penalty 30x to 50x for using resolved writer on arrays
 ---

 Key: AVRO-1089
 URL: https://issues.apache.org/jira/browse/AVRO-1089
 Project: Avro
  Issue Type: Bug
  Components: c
Affects Versions: 1.6.3, 1.7.0
 Environment: Ubuntu Linux
Reporter: Vivek Nadkarni
 Fix For: 1.7.0

 Attachments: AVRO-1089-performance.png

   Original Estimate: 48h
  Remaining Estimate: 48h

 The new performance tests created in AVRO-1088 show that using the
 resolved writer takes 30 to 50 times longer than using no schema
 resolution or using the resolved reader for simple and nested arrays.
 For a simple array, using the resolved writer took ~30x longer than
 using the memory reader that assumed a matching schema. For the nested
 array, using the resolved writer took ~50x longer.
 These results suggest that there is a bug in resolved writer. I do not
 have a proposed fix at this time.
  Running simple array matched schemas 
   25 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 2.123s
   Tests/sec:117739
  Running simple array resolved writer 
   1 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 2.747s
   Tests/sec:3641
  Running nested array matched schemas 
   25 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 3.030s
   Tests/sec:82508
  Running nested array resolved writer 
   1 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 6.650s
   Tests/sec:1504

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)
Catalin Alexandru Zamfir created AVRO-1090:
--

 Summary: DataFileWriter should expose sync marker to allow 
concurrent writes to same .avro file
 Key: AVRO-1090
 URL: https://issues.apache.org/jira/browse/AVRO-1090
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir


We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
this with two threads per node, on 8 nodes. Some of the nodes share the same 
path. For example, our: TimestampedWriter class, takes a path argument and 
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two 
nodes can access the same path. The race condition when these streams are 
written, is resolved with a check to see if the file exists (has been created) 
by a faster thread. If that's so, it appends, instead of creating the file on 
the HDFS.

The problem is that DataFileWriter, generates a 16-byte, random string for each 
instance. So, two threads with 2 different writer instances, have a different 
sync marker. That means that data, when trying to read it back, will get an 
IOException (Invalid sync!).

There's a big performance penalty here. Because only one writer can write at 
once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took 
us 4 hours to generate  load. With 20 concurrent threads, it took only 12.5 
minutes. 

If DataFileWriter would expose the sync marker, a developer could read that 
and make sure that the next thread that appends to the file, uses the same sync 
marker. Don't know if it's even possible to expose the sync marker so as other 
instances of DataFileWriter can share the sync marker, from the file. We have 
a fix for this, making sure each writer is an unique instance and generating 
a path based on that uniqueness. But instead of having 
SomePath/2012/05/14/Shard.avro we'd now have 
SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that 
write the data in.

If it can be done, it would be a huge fix for a bottleneck problem. The 
bottleneck being the single writer that can write to a single path.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers

2012-05-14 Thread Robert Fuller (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Fuller updated AVRO-1081:


Attachment: patch.diff.txt
ByteBufferTest.java

Attached one possible way of fixing this.

I am not 100% happy with the solution, but it should work for our case for now. 
We are writing several avro files concurrently from within a heavily 
multithreaded application, and cannot afford to load many of the files into 
memory at once at that point.

When reading the files again (in hadoop m/r job) it's ok for now to read the 
bytes back to memory (will read them into a byte array, but for the purpose of 
the test wanted to put them into a byte buffer).

Am also now wondering whether you might consider direct support for File (or 
something like FileData) in avro. This would allow a way to include into avro 
items which exceed the amount of available memory. But maybe it's not a 
requirement anybody else would have...

 GenericDatumWriter does not support native ByteBuffers
 --

 Key: AVRO-1081
 URL: https://issues.apache.org/jira/browse/AVRO-1081
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Robert Fuller
 Attachments: ByteBufferTest.java, ByteBufferTest.java, 
 patch.diff.txt, patch.diff.txt


 An exception is thrown when trying to encode bytes backed by a file.
 java.lang.UnsupportedOperationException: null
   at java.nio.ByteBuffer.arrayOffset(ByteBuffer.java:968) ~[na:1.6.0_31]
   at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:61) 
 ~[avro-1.6.3.jar:1.6.3]
 Note arrayOffset is an optional method, see:
 http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#arrayOffset%28%29
 FileChannel returns native ByteBuffer not HeapedByteBuffer
 See here:
 http://mail-archives.apache.org/mod_mbox/avro-user/201202.mbox/%3ccb57f421.6bfe2%25sc...@richrelevance.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers

2012-05-14 Thread Robert Fuller (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Fuller updated AVRO-1081:


Attachment: patch.diff.txt
ByteBufferTest.java

Attached one possible way of fixing this.

I am not 100% happy with the solution, but it should work for our case for now. 
We are writing several avro files concurrently from within a heavily 
multithreaded application, and cannot afford to load many of the files into 
memory at once at that point.

When reading the files again (in hadoop m/r job) it's ok for now to read the 
bytes back to memory (will read them into a byte array, but for the purpose of 
the test wanted to put them into a byte buffer).

Am also now wondering whether you might consider direct support for File (or 
something like FileData) in avro. This would allow a way to include into avro 
items which exceed the amount of available memory. But maybe it's not a 
requirement anybody else would have...

 GenericDatumWriter does not support native ByteBuffers
 --

 Key: AVRO-1081
 URL: https://issues.apache.org/jira/browse/AVRO-1081
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Robert Fuller
 Attachments: ByteBufferTest.java, ByteBufferTest.java, 
 patch.diff.txt, patch.diff.txt


 An exception is thrown when trying to encode bytes backed by a file.
 java.lang.UnsupportedOperationException: null
   at java.nio.ByteBuffer.arrayOffset(ByteBuffer.java:968) ~[na:1.6.0_31]
   at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:61) 
 ~[avro-1.6.3.jar:1.6.3]
 Note arrayOffset is an optional method, see:
 http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#arrayOffset%28%29
 FileChannel returns native ByteBuffer not HeapedByteBuffer
 See here:
 http://mail-archives.apache.org/mod_mbox/avro-user/201202.mbox/%3ccb57f421.6bfe2%25sc...@richrelevance.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274584#comment-13274584
 ] 

Catalin Alexandru Zamfir commented on AVRO-1090:


From the source-code of DatumFileWriter, the appendTo method seems to do 
what it's intended. But only accepts a (File) argument. In the case where one 
writer connects over the network to Hadoop and needs to write a 
FSDataOutputStream instead of a file, the advantages of the appendTo method 
cannot be used. So it seems it is possible to retrieve the sync marker from an 
existing .avro file and write forward with the same marker.

Can this be done here also? Can an appendTo (FSDataOutputStream) method be 
created? This would allow concurrent writers to create or append on the same 
output stream using the same marker, thus enabling the data to be read back.

 DataFileWriter should expose sync marker to allow concurrent writes to same 
 .avro file
 

 Key: AVRO-1090
 URL: https://issues.apache.org/jira/browse/AVRO-1090
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir

 We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
 this with two threads per node, on 8 nodes. Some of the nodes share the same 
 path. For example, our: TimestampedWriter class, takes a path argument and 
 appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or 
 two nodes can access the same path. The race condition when these streams 
 are written, is resolved with a check to see if the file exists (has been 
 created) by a faster thread. If that's so, it appends, instead of creating 
 the file on the HDFS.
 The problem is that DataFileWriter, generates a 16-byte, random string for 
 each instance. So, two threads with 2 different writer instances, have a 
 different sync marker. That means that data, when trying to read it back, 
 will get an IOException (Invalid sync!).
 There's a big performance penalty here. Because only one writer can write at 
 once to one given path, it becomes a bottleneck. For 1B (billion) rows, it 
 took us 4 hours to generate  load. With 20 concurrent threads, it took only 
 12.5 minutes. 
 If DataFileWriter would expose the sync marker, a developer could read that 
 and make sure that the next thread that appends to the file, uses the same 
 sync marker. Don't know if it's even possible to expose the sync marker so as 
 other instances of DataFileWriter can share the sync marker, from the file. 
 We have a fix for this, making sure each writer is an unique instance and 
 generating a path based on that uniqueness. But instead of having 
 SomePath/2012/05/14/Shard.avro we'd now have 
 SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers 
 that write the data in.
 If it can be done, it would be a huge fix for a bottleneck problem. The 
 bottleneck being the single writer that can write to a single path.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274592#comment-13274592
 ] 

Catalin Alexandru Zamfir commented on AVRO-1090:


Also, we're doing objRecordWriter.create (getHdfs ().append (objPath)) which 
should make a DataFileWriter on the FSDataOutputStream which respects the 
first sync marker of the written file. So: if thread #1 has already created the 
file, thread #2 can now append to the given path. But because it appends, it 
does not need to generateSync on the sync marker. Instead, it can read the 
sync marker from the already generated file and use it as it's own sync marker.

This does not happen. The fact that we get an Invalid Sync because of the 
fact we are creating multiple writers in different threads, even if one thread 
finishes first and create the given path, the next thread that should append 
to it does not seem to take in account the fact that it should first read the 
existing sync marker defined with the file. DataFileWriter should take in 
account that a file/path/stream written by it already contains a sync marker, 
and that there's no need to generate another one. It should get the existing 
sync marker and use that to append the data to the given HDFS path.

Thanks for your patience.

 DataFileWriter should expose sync marker to allow concurrent writes to same 
 .avro file
 

 Key: AVRO-1090
 URL: https://issues.apache.org/jira/browse/AVRO-1090
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir

 We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
 this with two threads per node, on 8 nodes. Some of the nodes share the same 
 path. For example, our: TimestampedWriter class, takes a path argument and 
 appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or 
 two nodes can access the same path. The race condition when these streams 
 are written, is resolved with a check to see if the file exists (has been 
 created) by a faster thread. If that's so, it appends, instead of creating 
 the file on the HDFS.
 The problem is that DataFileWriter, generates a 16-byte, random string for 
 each instance. So, two threads with 2 different writer instances, have a 
 different sync marker. That means that data, when trying to read it back, 
 will get an IOException (Invalid sync!).
 There's a big performance penalty here. Because only one writer can write at 
 once to one given path, it becomes a bottleneck. For 1B (billion) rows, it 
 took us 4 hours to generate  load. With 20 concurrent threads, it took only 
 12.5 minutes. 
 If DataFileWriter would expose the sync marker, a developer could read that 
 and make sure that the next thread that appends to the file, uses the same 
 sync marker. Don't know if it's even possible to expose the sync marker so as 
 other instances of DataFileWriter can share the sync marker, from the file. 
 We have a fix for this, making sure each writer is an unique instance and 
 generating a path based on that uniqueness. But instead of having 
 SomePath/2012/05/14/Shard.avro we'd now have 
 SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers 
 that write the data in.
 If it can be done, it would be a huge fix for a bottleneck problem. The 
 bottleneck being the single writer that can write to a single path.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation

2012-05-14 Thread Thiruvalluvan M. G. (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. updated AVRO-1079:
--

Assignee: Thiruvalluvan M. G.
  Status: Patch Available  (was: Open)

 C++ Generator, improve include guard generation
 ---

 Key: AVRO-1079
 URL: https://issues.apache.org/jira/browse/AVRO-1079
 Project: Avro
  Issue Type: Improvement
  Components: c++
Affects Versions: 1.6.3, 1.7.0
Reporter: falk.trist...@cae.de
Assignee: Thiruvalluvan M. G.
Priority: Minor
 Attachments: AVRO-1079.patch

   Original Estimate: 5h
  Remaining Estimate: 5h

 We integrated avro into our cmake build system. So we have json - c++ header 
 build step which is quite similar to the QT moc-compiler. This build step 
 will overwrite the target header-files if the json schema file was modified 
 or when the generated header files are different.
 In addition to that we want to put the generated header files under version 
 control.
 However, the generated include-guards contains some random parts. This random 
 part troubles the version control system. In addition to that this random 
 part leads to unnecessary rebuilds as cmake 'thinks' important files have 
 been changed and triggers a rebuild of the corresponding dependencies.
 Suggestion: 
 Add an additional command line parameter to either switch off the random part 
 of the string, or to take a string from the command line and use this as 
 include guard.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation

2012-05-14 Thread Thiruvalluvan M. G. (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. updated AVRO-1079:
--

Attachment: AVRO-1079.patch

This patch tries to reuse the include guard if the output file already exists.

 C++ Generator, improve include guard generation
 ---

 Key: AVRO-1079
 URL: https://issues.apache.org/jira/browse/AVRO-1079
 Project: Avro
  Issue Type: Improvement
  Components: c++
Affects Versions: 1.6.3, 1.7.0
Reporter: falk.trist...@cae.de
Priority: Minor
 Attachments: AVRO-1079.patch

   Original Estimate: 5h
  Remaining Estimate: 5h

 We integrated avro into our cmake build system. So we have json - c++ header 
 build step which is quite similar to the QT moc-compiler. This build step 
 will overwrite the target header-files if the json schema file was modified 
 or when the generated header files are different.
 In addition to that we want to put the generated header files under version 
 control.
 However, the generated include-guards contains some random parts. This random 
 part troubles the version control system. In addition to that this random 
 part leads to unnecessary rebuilds as cmake 'thinks' important files have 
 been changed and triggers a rebuild of the corresponding dependencies.
 Suggestion: 
 Add an additional command line parameter to either switch off the random part 
 of the string, or to take a string from the command line and use this as 
 include guard.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Alexandru Zamfir updated AVRO-1090:
---

Description: 
We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
this with two threads per node, on 8 nodes. Some of the nodes share the same 
path. For example, our: TimestampedWriter class, takes a path argument and 
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two 
nodes can access the same path. The race condition when these streams are 
written, is resolved with a check to see if the file exists (has been created) 
by a faster thread. If that's so, it appends, instead of creating the file on 
the HDFS.

The problem is that DataFileWriter, generates a 16-byte, random string for each 
instance. So, two threads with 2 different writer instances, have a different 
sync marker. That means that data, when trying to read it back, will get an 
IOException (Invalid sync!).

There's a big performance penalty here. Because only one writer can write at 
once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took 
us 4 hours to generate  load. With 20 concurrent threads, it took only 12.5 
minutes. 

If DataFileWriter would expose the sync marker, a developer could read that 
and make sure that the next thread that appends to the file, uses the same sync 
marker. Don't know if it's even possible to expose the sync marker so as other 
instances of DataFileWriter can share the sync marker, from the file. We have 
a fix for this, making sure each writer is an unique instance and generating 
a path based on that uniqueness. But instead of having 
SomePath/2012/05/14/Shard.avro we'd now have 
SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that 
write the data in.

If it can be done, it would be a huge fix for a bottleneck problem. The 
bottleneck being the single writer that can write to a single path.

THIS HAS ALSO been requested on the avro-user thread: 
http://grokbase.com/t/avro/user/122m4sjm1y/is-it-possible-to-append-to-an-already-existing-avro-file
I just could not find the JIRA ticket for this request.

  was:
We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
this with two threads per node, on 8 nodes. Some of the nodes share the same 
path. For example, our: TimestampedWriter class, takes a path argument and 
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two 
nodes can access the same path. The race condition when these streams are 
written, is resolved with a check to see if the file exists (has been created) 
by a faster thread. If that's so, it appends, instead of creating the file on 
the HDFS.

The problem is that DataFileWriter, generates a 16-byte, random string for each 
instance. So, two threads with 2 different writer instances, have a different 
sync marker. That means that data, when trying to read it back, will get an 
IOException (Invalid sync!).

There's a big performance penalty here. Because only one writer can write at 
once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took 
us 4 hours to generate  load. With 20 concurrent threads, it took only 12.5 
minutes. 

If DataFileWriter would expose the sync marker, a developer could read that 
and make sure that the next thread that appends to the file, uses the same sync 
marker. Don't know if it's even possible to expose the sync marker so as other 
instances of DataFileWriter can share the sync marker, from the file. We have 
a fix for this, making sure each writer is an unique instance and generating 
a path based on that uniqueness. But instead of having 
SomePath/2012/05/14/Shard.avro we'd now have 
SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that 
write the data in.

If it can be done, it would be a huge fix for a bottleneck problem. The 
bottleneck being the single writer that can write to a single path.


 DataFileWriter should expose sync marker to allow concurrent writes to same 
 .avro file
 

 Key: AVRO-1090
 URL: https://issues.apache.org/jira/browse/AVRO-1090
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir

 We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
 this with two threads per node, on 8 nodes. Some of the nodes share the same 
 path. For example, our: TimestampedWriter class, takes a path argument and 
 appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or 
 two nodes can access the same path. The race condition when these streams 
 are written, is resolved with a check to see if the file exists (has been 
 created) by a faster thread. If that's so, 

[jira] [Commented] (AVRO-1085) Fingerprinting for C#

2012-05-14 Thread Thiruvalluvan M. G. (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274743#comment-13274743
 ] 

Thiruvalluvan M. G. commented on AVRO-1085:
---

+1. Looks good to me.

Built and tested on Windows 7, Visual C# Express 2010.

 Fingerprinting for C#
 -

 Key: AVRO-1085
 URL: https://issues.apache.org/jira/browse/AVRO-1085
 Project: Avro
  Issue Type: New Feature
  Components: csharp
Affects Versions: 1.7.0
Reporter: Eric Hauser
 Fix For: 1.7.0

 Attachments: AVRO-1085.patch


 Avro fingerprinting for C#.  This is a direct port of the Java 
 SchemaNormalization class and tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira