As a side comment, in previous benchmarking I've done on other systems, I've found that using memory mapped IO (part of Java NIO) can make a lot of difference.
Also, when we put in gzip we expected it to speed things up, but it actually quite slowed things down. -Marshall On 8/15/2012 4:09 AM, Richard Eckart de Castilho wrote: > Hi, > > I am looking for a way to improve loading times in an application, so I did a > little experiment with binary CAS serialization to see if it was superior to > XMI serialization. For serialization I used the CASCompleteSerializer to > serialize the type-system and heaps into the same file using Java object > serialization - at least that is what I understood it should do. To read in > these files, I would deserialize the CASCompleteSerializer and initialize a > CAS from it using CASImpl.reinit(). > > 96.400 files > > plain text (uncompressed) : 581.865.593 Byte > binary (serialized java, gzip) : 0:47:02.835 3.555.449.597 Byte > xmi (gzip) : 1:20:31.535 4.712.633.769 Byte > > So binary takes about 60% of the time xmi serialization would need and uses > about 75% of the space. > I didn't do reading experiment yet, but I suppose the improvement should be > on a similar level, if not better. > > I am also not sure yet about the draw-backs of binary serialization and in > which scenarios they apply. The draw-backs I saw so far are: > > - Type-system is stored redudantly in every output file. > - The type system configured with CASImpl.reinit() may be different from the > one which was used to initialize the pipeline, CAS-based annotators relying > on typeSystemInit() may not be configured with the correct types - this is a > hypothesis I didn't test. > - Serialized Java objects may become due to refactoring within the UIMA > framework. However, there is yet another binary CAS serialization in UIMA > which uses the DataOutputStream and may be more stable. > > Did anybody ever use any form of binary CAS serialization outside > Vinci/UIMA-AS? > > Cheers, > > -- Richard >
