On the metadata issue, those rows refer to the embedded bmps in that one ppt.
TODO: I should add the embedded file name in that metadata value count details file so that you can see which embedded file has diff counts. There are, in fact, fewer metadata items. This is likely caused by a change in Tika and our dependencies. This in 1.23: "Chroma ColorSpaceType": "RGB", "Chroma NumChannels": "3", "Compression CompressionTypeName": "BI_RGB", "Compression Lossless": "true", "Content-Type": "image/bmp", "Data BitsPerSample": "8 8 8", "Data SampleFormat": "UnsignedIntegral", "Dimension HorizontalPhysicalPixelSpacing": "0.26462027", "Dimension HorizontalPixelSize": "0.26462027", "Dimension PixelAspectRatio": "1.0", "Dimension VerticalPhysicalPixelSpacing": "0.26462027", "Dimension VerticalPixelSize": "0.26462027", "Document FormatVersion": "BMP v. 3.x", "Transparency Alpha": "none", "X-Parsed-By": [ "org.apache.tika.parser.CompositeParser", "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.image.ImageParser" ], "X-TIKA:digest:MD5": "d29cb08f19bfd203d5517cdac8f36dd4", "X-TIKA:embedded_depth": "1", "X-TIKA:embedded_resource_path": "/embedded-4", "X-TIKA:parse_time_millis": "2", "height": "1", "tiff:BitsPerSample": "8 8 8", "tiff:ImageLength": "1", "tiff:ImageWidth": "2", "width": "2" This is what we get 1.24-SNAPSHOT and poi 4.1.2: "Compression CompressionTypeName": "BI_RGB", "Content-Type": "image/bmp", "Data BitsPerSample": "8 8 8", "Dimension HorizontalPhysicalPixelSpacing": "0.26462027", "Dimension PixelAspectRatio": "1.0", "Dimension VerticalPhysicalPixelSpacing": "0.26462027", "X-Parsed-By": [ "org.apache.tika.parser.CompositeParser", "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.image.ImageParser" ], "X-TIKA:digest:MD5": "d29cb08f19bfd203d5517cdac8f36dd4", "X-TIKA:embedded_depth": "1", "X-TIKA:embedded_resource_path": "/embedded-4", "X-TIKA:parse_time_millis": "3", "height": "1", "tiff:BitsPerSample": "8 8 8", "tiff:ImageLength": "1", "tiff:ImageWidth": "2", "width": "2" On Fri, Feb 7, 2020 at 1:31 PM Tim Allison <talli...@apache.org> wrote: > a) In the SQLs, I see the *_a/*_b tables - so _a is then result of using > POI 4.1.1 and _b of POI 4.1.2? > a is tika 1.23 (which used 4.1.1), b is tika 1.x branch with 4.1.2 -- > *WARNING -- diffs we observe may be changes in Tika btwn 1.23 and 1.x > branch. > > b) Are the stats evaluated for both each time or is *_a cached from last > run? > I had to rerun 1.23 because I had wiped it out. > > b) If a) is true, it's interesting that the attachment-missing* have such > similar numbers. I would expect one side to outweigh the other. > That is unexpected. Aligning attachments is tricky if one version is > missing a version. It is possible that this reflects failure to align. > I'll look into this. > > c) I've checked one of metadata diffs (govdocs1/338/338907.ppt) and can't > reproduce/don't understand the values in the report > I've put the .json output here: > http://162.242.228.174/share/338907_ppt.tgz. I haven't looked yet, but > will. > > d) looking at the parse times: there are quite a few .ppt which only take > 100-400ms in _a whereas in _b it takes them 3-5 sec. > > That _may_ be caused by diffs in loads on the m|vm...other stuff going on > in the jvm. Parse times per file can vary wildly > even with the same versions on different runs. The key for me is the > rollup by parse time suggests _overall_ for ppt, > the time is nearly identical. > > >> On 07.02.20 13:05, Tim Allison wrote: >> > Hi All,, >> > I haven't had the chance to look, but will do so later today:: >> > http://162.242.228.174/reports/poi_4.1.2_reports.tgz >> >> >>