RE: RE: reading images out of xls files (and transforming them?)

Michael Zalewski Sat, 25 Oct 2003 18:04:44 -0700

The problem is that RecordFactory creates Biff records. But the
MSODRAWINGGROUP record extends over several Biff records. It works kind of
like CONTINUE records, except it would be 1 to many consecutive
MSODRAWINGGROUP records.


The problem here is that the MSODRAWINGGROUP structure is not a Biff record.
It is a large structure that is stored on one to many Biff records with a
type of MSODRAWINGGROUP.

Things get even more fun with MSODRAWING. In this case, the structure is
also too large to fit in a single Biff record. But Excel intersperses the
MSODRAWING biff records with some undocumented record - I think it's a
description of the range name associated with a drawing shape. And Excel has
more complicated rules about how the MSODRAWING record is broken up.
Roughly, each drawing shape gets its own BIFF record - and the first
MSODRAWING record has the container for the entire worksheet.

RecordFactory (as it currently exists) returns only Biff record subclasses.
And in the case of these drawing structures, the Biff record contains only a
fragment of the structure. I have not yet been able to make it work with
anything larger than 8K.

I suppose, there might need to be some superclass of
org.apache.poi.hssf.record.Record, which could contain things like
MSODRAWINGGROUP that are actually stored as multiple Biff records. Then
RecordFactory.createRecords( InputStream) would return a List of these
superclasses.

The alternative is to add a field last_drawing_record into RecordFactory,
then add more branches to the if/then logic that handles Continue records.
The latter is hard to do in a way that will not impact performance, because
there is no separate record id to trigger the processing (as is the case for
a CONTINUE record). And also, as I stated above, MSODRAWINGGROUP biff
records are contiguous, but MSODRAWING biff records are not. I do not know
that there could not be records in between MSODRAWING records that needed
their own CONTINUE records.

Michael Zalewski

-----Original Message-----
From: Avik Sengupta [mailto:[EMAIL PROTECTED]
Sent: Saturday, October 25, 2003 2:49 PM
To: POI Developers List
Subject: Re: RE: reading images out of xls files (and transforming them?)

> Anybody want to continue this discussion?
Yes of course!

The best thing would be to throw some code and have people play around a bit
with it.

In terms of how to fit it into poi ...

Well, in general, to read, the first place would be to hook it into
RecordFactory, and branch off from there.

In terms of high level access, we try to keep it mapped to excel... is an
image linked to a cell, or to a sheet?? we have similar links from the high
level objects.....

Writing may be more difficult,  but in general, every record serializes
itself, so i dont think it will be impossible to have a custom serialisation
process.....

just some thought, probably needs more playing around...

Regards
-
Avik







   -------Original Message-------
  From: Michael Zalewski &lt;[EMAIL PROTECTED]&gt;
  Subject: RE: reading images out of xls files (and transforming them?)
  Sent: 25 Oct 2003 09:43:32

  I have worked out the details of how this works. But I am not sure how to
  fit it into HSSF.

  Images in Excel spreadsheets can be of two types. The kind with 'Insert
  Picture', and the kind you get with 'Insert Object'. The 'Insert Picture'
  ones are the easiest to handle. They work like this:

  There is a record type called MSODRAWINGGROUP (type 0xebh, it's not
  implemented in HSSF). It contains a hierarchy of drawing records, used by
  the drawing layer. Inside this hierarchy, you will find something called a
  BLIP Entry record (it's not a BIFF record - it's a piece of the
  MSODRAWINGGROUP structure).

  IMDATA type 07fh is an obsolete record type (I think). You might be able
to
  get it when you read an older version of Excel into Excel 97. When you use
  Excel 97 and above, Insert | Picture gives you the MSODRAWINGGROUP record
I
  describe below.

  Drawing records come in two types - containers and atoms. Containers are
  like file directories - they contain one or more child drawing records,
some
  of which can also be containers. Atoms are like files, or leafs of a tree.

  The layout of a drawing record is like this: There is an eight byte fixed
  length header, and a variable length detail part. The length of the
variable
  length part is described in the header. The header has 4 fields

  The first two bytes contain 2 fields - the instance variable and the
record
  version. But the instance variable is 12 bits (1 1/2 bytes), and the
record
  version is the remaining 4 bytes. The interpretation of instance data
  depends on the record type. Often it is zero. The version is always the
same
  for the same record type in Excel. I suspect other Office products may
have
  different versions, or perhaps later versions of Excel might have higher
  versions of some record types. The version is always 0xfh for a container.
  Anything else tells you the record is an atom.

  Bytes 3 - 4 are the record ID, which tells you the type of record. Most of
  the records have a record ID like F003h. The record you are interested in
  will have a record ID of F007h (BLIP Entry - I will explain later) In the
  case of a BLIP container, the instance data is the number of drawings in
the
  worksheet.

  Bytes 5 - 8 are the length of the variable part. Container records can
have
  nested child records in the variable length part. Atoms might have a
  zero-length variable part.

  So if a container had two atoms, and each atom had no variable length
data,
  there would be 3 drawing group records. The length field in the container
  would be 16 (because the two atoms are both 8 bytes long). And the length
  field in each atom would be 0. The total length of a drawing record would
be
  the length field + 8.

  BLIP Entry Atoms are the parts that contain the binary image. The version
in
  Excel is 6, and the instance data is a sequentially assigned number called
  the BLIP ID. (I didn't make these terms up). Other drawing records will
  refer to the image by the value of the BLIP ID.

  The image is usually in PNG format. You can tell the format by the first
  byte of the atom payload (the variable length part), which is 0 for a
  deleted entry, 5 for JPEG, 6 for PNG, and 7 for bitmaps. (But most images
  get converted to PNG, even if you insert them as JPG or BMP). If the first
  byte is 0x6h, you should see a marker byte with a value of 0xff at offset
  60, and you should find the PNG image beginning at offset 61. (I'm
skipping
  a lot of detail here. I'm not even sure of the correct interpretation of
all
  the bytes in between.) I think you will always find something at offset 36
  that looks like an atom, with a record type of f01e, and instance data 'n'
   (06eh). I think that means that the PNG is not compressed.

  Now for the fun part. The MSODRAWINGGROUP record I described above can be
  huge. In theory, over a gigbabyte. But Biff records can only be about 8K.
So
  what if your image is too large to fit?

  You would expect that Excel would store the overflow in CONTINUE records,
 like it does for other records which get to be longer than the 8K limit.
But
 apparently, that is not true for drawing records, which get stored as
 contiguous MSODRAWINGGROUP records.

 So HSSF wants to split this nested structure that I described above into
 discrete sequential units, and it usually splits right in the middle of an
 atom. That makes the implementation hard in HSSF (otherwise, I have most of
  a patch completed for inserting pictures. But the patch is *huge* -- 6
  modified HSSF classes so far, 6 new HSSF classes,  45 new classes to
  represent various atoms, containers, and other structures, and some 30 odd
  test cases.

  I am stuck on how to interface this patch with HSSF. If anyone has any
  thoughts, I would love to hear. Frankly, I thought someone else would have
  provided a solution by now - I thought at least two people were working on
  it. Anybody want to continue this discussion?

  Michael Zalewski

   -----Original Message-----
   From: Steve Widmar [mailto:[EMAIL PROTECTED]
   Sent: Friday, October 24, 2003 2:09 AM
   To: [EMAIL PROTECTED]
   Subject: reading images out of xls files (and transforming them?)

   Hi -

   I have been asked to support a requirement to read images included inside
a
  .xls binary and publish them online (converted to gif|png)
  (plus a whole bunch of other stuff I *can* figger out how to do myself
with
  Poi, thanks to the Poi community...).

  It doesn't look like this functionality is supported in  2.0 pre3.

  I searched the javadocs (esp. HSSFCell and ..record.* - conspicuous gap
  betw. HideObjectRecord and IndexRecord where IMDATA support s/b?!), he
  other online dox and archives ...

  I found a lot of threads on writing images, with the general
recommendation
  being using a template etc.etc. but nothing about reading and nothing
  indicating it is supported.

   I found this bit of inspiration from Andy
   http://www.mail-archive.com/[EMAIL PROTECTED]/msg02255.html,
but
  nothing else and I suspect his long-term estimates are like mine
(aggressive
  and assuming nothing else comes up).

  I haven't yet tucked into the code analysis - but I did see where the
patch
  to read embedded images would be a beefy effort
  (http://www.mail-archive.com/[EMAIL PROTECTED]/msg02035.html).

  If anyone out there wants to give me a nudge in the right direction
  (like elaboratiing a bit on &quot;at least 10 classes, includes 3 new
substantial
  ones and possibly an entire new top level port&quot;), I wouldn't mind at
all to
  scratch this itch myself... (and I even gots a copy of MS Excel 97
  Developer's Kit).

  Thanks -
  Steve


  ---------------------------------------------------------------------
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  -------Original Message-------



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: RE: reading images out of xls files (and transforming them?)

Reply via email to