I had the thought that perhaps it would be quite easy to add Type System (only) serialization directly into Form 6 using all of its fancy compression techniques. Only those types used would need to be serialized. This would make it very convenient and efficient to store the type system (which would be the one used to filter the serialization, if present) with the serialized form.
I might take a swing at doing that... -Marshall On 8/11/2016 1:59 PM, Richard Eckart de Castilho wrote: > On 11.08.2016, at 19:43, Marshall Schor <[email protected]> wrote: >> I'm working on this now. >> >> I note that the new load(InputStream, CasMgrSerialzer, CAS, boolean) method >> is >> "public". Is there some code (perhaps in DkPro) that needs this form? >> >> If not, I'll remove this method and make the reading to create the >> CasMgrSerializer "lzay" - not done until needed. > Yep, I need something like that in DKPro. > > When the type system information is stored outside the binary CAS in a > separate file, that TSI file would have to be re-read for every CAS file. > Being able to pass he CasMgrSerialzer to load() allows me to read it only > once. > >> Not sure about zipping the type system - we have 3 choices, perhaps: 1) >> nothing, >> 2) zip, 3) custom compression zip (like the rest of form 6). >> >> I'm leaning toward doing this work (if ever done) later. > I've been pushing that ahead since implementing the BinaryCasReader/Writer :) > Probably doesn't hurt if it gets pushed ahead a bit further. > I had a quick look at the CasMgrSerialzer - you called it highly inefficient. > It doesn't look that inefficient. At least it uses primitive and String arrays > and not collections :) > >> ================ >> >> I have one more question - there's a comment which I don't see implemented - >> which says that when a set of deserializations are being done with the same >> type >> system, the extra work to handle the type system is only done once: >> >> * This method avoids the repeated loading of the typesystem and index >> definitions >> * from a stream when loading many CASes in a row. >> >> How do you think that should be implemented? > Well, that's happening when I read the CasMgrSerialzer from a separate file - > as > explained above: > > casMgr = read(casMgrFile) > for (file in directory) { > load(file, casMgr, CAS, boolean) > } > > Cheers, > > -- Richard > >
