The problem is that RecordFactory creates Biff records. But the MSODRAWINGGROUP record extends over several Biff records. It works kind of like CONTINUE records, except it would be 1 to many consecutive MSODRAWINGGROUP records.
The problem here is that the MSODRAWINGGROUP structure is not a Biff record. It is a large structure that is stored on one to many Biff records with a type of MSODRAWINGGROUP. Things get even more fun with MSODRAWING. In this case, the structure is also too large to fit in a single Biff record. But Excel intersperses the MSODRAWING biff records with some undocumented record - I think it's a description of the range name associated with a drawing shape. And Excel has more complicated rules about how the MSODRAWING record is broken up. Roughly, each drawing shape gets its own BIFF record - and the first MSODRAWING record has the container for the entire worksheet. RecordFactory (as it currently exists) returns only Biff record subclasses. And in the case of these drawing structures, the Biff record contains only a fragment of the structure. I have not yet been able to make it work with anything larger than 8K. I suppose, there might need to be some superclass of org.apache.poi.hssf.record.Record, which could contain things like MSODRAWINGGROUP that are actually stored as multiple Biff records. Then RecordFactory.createRecords( InputStream) would return a List of these superclasses. The alternative is to add a field last_drawing_record into RecordFactory, then add more branches to the if/then logic that handles Continue records. The latter is hard to do in a way that will not impact performance, because there is no separate record id to trigger the processing (as is the case for a CONTINUE record). And also, as I stated above, MSODRAWINGGROUP biff records are contiguous, but MSODRAWING biff records are not. I do not know that there could not be records in between MSODRAWING records that needed their own CONTINUE records. Michael Zalewski -----Original Message----- From: Avik Sengupta [mailto:[EMAIL PROTECTED] Sent: Saturday, October 25, 2003 2:49 PM To: POI Developers List Subject: Re: RE: reading images out of xls files (and transforming them?) > Anybody want to continue this discussion? Yes of course! The best thing would be to throw some code and have people play around a bit with it. In terms of how to fit it into poi ... Well, in general, to read, the first place would be to hook it into RecordFactory, and branch off from there. In terms of high level access, we try to keep it mapped to excel... is an image linked to a cell, or to a sheet?? we have similar links from the high level objects..... Writing may be more difficult, but in general, every record serializes itself, so i dont think it will be impossible to have a custom serialisation process..... just some thought, probably needs more playing around... Regards - Avik -------Original Message------- From: Michael Zalewski <[EMAIL PROTECTED]> Subject: RE: reading images out of xls files (and transforming them?) Sent: 25 Oct 2003 09:43:32 I have worked out the details of how this works. But I am not sure how to fit it into HSSF. Images in Excel spreadsheets can be of two types. The kind with 'Insert Picture', and the kind you get with 'Insert Object'. The 'Insert Picture' ones are the easiest to handle. They work like this: There is a record type called MSODRAWINGGROUP (type 0xebh, it's not implemented in HSSF). It contains a hierarchy of drawing records, used by the drawing layer. Inside this hierarchy, you will find something called a BLIP Entry record (it's not a BIFF record - it's a piece of the MSODRAWINGGROUP structure). IMDATA type 07fh is an obsolete record type (I think). You might be able to get it when you read an older version of Excel into Excel 97. When you use Excel 97 and above, Insert | Picture gives you the MSODRAWINGGROUP record I describe below. Drawing records come in two types - containers and atoms. Containers are like file directories - they contain one or more child drawing records, some of which can also be containers. Atoms are like files, or leafs of a tree. The layout of a drawing record is like this: There is an eight byte fixed length header, and a variable length detail part. The length of the variable length part is described in the header. The header has 4 fields The first two bytes contain 2 fields - the instance variable and the record version. But the instance variable is 12 bits (1 1/2 bytes), and the record version is the remaining 4 bytes. The interpretation of instance data depends on the record type. Often it is zero. The version is always the same for the same record type in Excel. I suspect other Office products may have different versions, or perhaps later versions of Excel might have higher versions of some record types. The version is always 0xfh for a container. Anything else tells you the record is an atom. Bytes 3 - 4 are the record ID, which tells you the type of record. Most of the records have a record ID like F003h. The record you are interested in will have a record ID of F007h (BLIP Entry - I will explain later) In the case of a BLIP container, the instance data is the number of drawings in the worksheet. Bytes 5 - 8 are the length of the variable part. Container records can have nested child records in the variable length part. Atoms might have a zero-length variable part. So if a container had two atoms, and each atom had no variable length data, there would be 3 drawing group records. The length field in the container would be 16 (because the two atoms are both 8 bytes long). And the length field in each atom would be 0. The total length of a drawing record would be the length field + 8. BLIP Entry Atoms are the parts that contain the binary image. The version in Excel is 6, and the instance data is a sequentially assigned number called the BLIP ID. (I didn't make these terms up). Other drawing records will refer to the image by the value of the BLIP ID. The image is usually in PNG format. You can tell the format by the first byte of the atom payload (the variable length part), which is 0 for a deleted entry, 5 for JPEG, 6 for PNG, and 7 for bitmaps. (But most images get converted to PNG, even if you insert them as JPG or BMP). If the first byte is 0x6h, you should see a marker byte with a value of 0xff at offset 60, and you should find the PNG image beginning at offset 61. (I'm skipping a lot of detail here. I'm not even sure of the correct interpretation of all the bytes in between.) I think you will always find something at offset 36 that looks like an atom, with a record type of f01e, and instance data 'n' (06eh). I think that means that the PNG is not compressed. Now for the fun part. The MSODRAWINGGROUP record I described above can be huge. In theory, over a gigbabyte. But Biff records can only be about 8K. So what if your image is too large to fit? You would expect that Excel would store the overflow in CONTINUE records, like it does for other records which get to be longer than the 8K limit. But apparently, that is not true for drawing records, which get stored as contiguous MSODRAWINGGROUP records. So HSSF wants to split this nested structure that I described above into discrete sequential units, and it usually splits right in the middle of an atom. That makes the implementation hard in HSSF (otherwise, I have most of a patch completed for inserting pictures. But the patch is *huge* -- 6 modified HSSF classes so far, 6 new HSSF classes, 45 new classes to represent various atoms, containers, and other structures, and some 30 odd test cases. I am stuck on how to interface this patch with HSSF. If anyone has any thoughts, I would love to hear. Frankly, I thought someone else would have provided a solution by now - I thought at least two people were working on it. Anybody want to continue this discussion? Michael Zalewski -----Original Message----- From: Steve Widmar [mailto:[EMAIL PROTECTED] Sent: Friday, October 24, 2003 2:09 AM To: [EMAIL PROTECTED] Subject: reading images out of xls files (and transforming them?) Hi - I have been asked to support a requirement to read images included inside a .xls binary and publish them online (converted to gif|png) (plus a whole bunch of other stuff I *can* figger out how to do myself with Poi, thanks to the Poi community...). It doesn't look like this functionality is supported in 2.0 pre3. I searched the javadocs (esp. HSSFCell and ..record.* - conspicuous gap betw. HideObjectRecord and IndexRecord where IMDATA support s/b?!), he other online dox and archives ... I found a lot of threads on writing images, with the general recommendation being using a template etc.etc. but nothing about reading and nothing indicating it is supported. I found this bit of inspiration from Andy http://www.mail-archive.com/[EMAIL PROTECTED]/msg02255.html, but nothing else and I suspect his long-term estimates are like mine (aggressive and assuming nothing else comes up). I haven't yet tucked into the code analysis - but I did see where the patch to read embedded images would be a beefy effort (http://www.mail-archive.com/[EMAIL PROTECTED]/msg02035.html). If anyone out there wants to give me a nudge in the right direction (like elaboratiing a bit on "at least 10 classes, includes 3 new substantial ones and possibly an entire new top level port"), I wouldn't mind at all to scratch this itch myself... (and I even gots a copy of MS Excel 97 Developer's Kit). Thanks - Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -------Original Message------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
