[gears-eng] a code review: 12193738 Make desktop.extractMetaData now recognize MP3, PDF and

Nigel Tao Tue, 04 Aug 2009 17:24:24 -0700

This is a re-mail, this time with the patch attached.



Hello noel,

I'd like you to do a code review.  Please execute
        g4 diff -c 12193738

or point your web browser to
        http://mondrian/12193738

to review the following code:

Change 12193738 by nigel...@nigeltao-srcgears1 on 2009/08/04 20:52:51 *pending*

        Make desktop.extractMetaData now recognize MP3, PDF and ZIP files.
        
        PRESUBMIT=passed
        R=noel
        [email protected]
        DELTA=135  (108 added, 26 deleted, 1 changed)
        OCL=12193738

Affected files ...

... 
//depot/googleclient/gears/opensource/gears/desktop/meta_data_extraction.cc#6 
edit

135 delta lines: 108 added, 26 deleted, 1 changed

Also consider running:
        g4 lint -c 12193738

which verifies that the changelist doesn't introduce new style violations.

If you can't do the review, please let me know as soon as possible.  During
your review, please ensure that all new code has corresponding unit tests and
that existing unit tests are updated appropriately.  Visit
http://www/eng/code_review.html for more information.

This is a semiautomated message from "g4 mail".  Complaints or suggestions?
Mail [email protected].

Change 12193738 by nigel...@nigeltao-srcgears1 on 2009/08/04 20:52:51 *pending*

        Make desktop.extractMetaData now recognize MP3, PDF and ZIP files.
        
        PRESUBMIT=passed
        R=noel
        [email protected]
        DELTA=135  (108 added, 26 deleted, 1 changed)
        OCL=12193738

Affected files ...

... 
//depot/googleclient/gears/opensource/gears/desktop/meta_data_extraction.cc#6 
edit

==== 
//depot/googleclient/gears/opensource/gears/desktop/meta_data_extraction.cc#6 - 
/home/nigeltao/srcgears1/googleclient/gears/opensource/gears/desktop/meta_data_extraction.cc
 ====
# action=edit type=text
--- googleclient/gears/opensource/gears/desktop/meta_data_extraction.cc 
2009-08-04 20:53:23.000000000 +1000
+++ googleclient/gears/opensource/gears/desktop/meta_data_extraction.cc 
2009-08-04 20:50:08.000000000 +1000
@@ -119,20 +119,6 @@
 
 
 static bool ExtractMetaDataJpeg(BlobInterface *blob, JsObject *result) {
-  static const int kJpegMagicHeaderLength = 2;
-  if (blob->Length() < kJpegMagicHeaderLength) {
-    return false;
-  }
-  uint8 magic_header[kJpegMagicHeaderLength];
-  if (kJpegMagicHeaderLength !=
-          blob->Read(magic_header, 0, kJpegMagicHeaderLength)) {
-    return false;
-  }
-  // The JPEG Start-Of-Image (SOI) marker is 0xFFD8.
-  if (magic_header[0] != 0xFF || magic_header[1] != 0xD8) {
-    return false;
-  }
-
   JpegBlobReadContext context;
   context.blob = blob;
   context.offset = 0;
@@ -206,17 +192,6 @@
 
 static bool ExtractMetaDataPng(BlobInterface *blob, JsObject *result) {
   static const int kPngMagicHeaderLength = 8;
-  if (blob->Length() < kPngMagicHeaderLength) {
-    return false;
-  }
-  uint8 magic_header[kPngMagicHeaderLength];
-  if (kPngMagicHeaderLength !=
-          blob->Read(magic_header, 0, kPngMagicHeaderLength)) {
-    return false;
-  }
-  if (png_sig_cmp(magic_header, 0, kPngMagicHeaderLength) != 0) {
-    return false;
-  }
 
   png_structp png_ptr = png_create_read_struct(
       PNG_LIBPNG_VER_STRING, NULL, NULL, NULL);
@@ -235,6 +210,7 @@
 
   PngBlobReadContext context;
   context.blob = blob;
+  // We've already read the magic header, in ExtractViaMagicHeader.
   context.offset = kPngMagicHeaderLength;
   png_set_sig_bytes(png_ptr, kPngMagicHeaderLength);
   png_set_read_fn(png_ptr, &context, PngBlobReadFunction);
@@ -262,12 +238,118 @@
 
 
 //-----------------------------------------------------------------------------
+// Microsoft Office meta-data extraction functions
+
+
+static bool ExtractMetaDataMsOffice(BlobInterface *blob, JsObject *result) {
+  // The Microsoft Office binary file formats are documented at
+  // http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
+  //
+  // A heuristic for distinguishing Word, Excel and Powerpoint files (by the
+  // files' contents, not by the files' names and extensions) is discussed at
+  // http://social.msdn.microsoft.com/Forums/en-US/os_odf/thread/
+  //     343d09e3-5fdf-4b4a-9fa6-8ccb37a35930
+  // Of course, another obvious heuristic at the JavaScript level is simply
+  // comparing the extension of the original file to ".doc", ".xls" or ".ppt"
+  // (provided, of course, that the file extensions are available, which might
+  // not be true for arbitrary Blobs).
+  //
+  // An exact algorithm for determining the file's metadata would come from
+  // parsing the Summary Information Stream, which would yield those properties
+  // listed at http://msdn.microsoft.com/en-us/library/aa380376(VS.85).aspx and
+  // http://poi.apache.org/apidocs/org/apache/poi/hpsf/SummaryInformation.html
+  //
+  // However, parsing the Windows Compound Binary File Format is not trivial,
+  // and finding the Summary Information Stream does not necessarily involve
+  // simply jumping to a fixed offset.
+  //
+  // Thus, for now, we simply return false (and hence ultimately return
+  // "application/octet-stream" to JavaScript). A future code change might
+  // implement walking the file format and cracking open the information
+  // within, but for now, this is unimplemented.
+  return false;
+}
+
+
+//-----------------------------------------------------------------------------
 // Public API
 
 
+static bool ExtractViaMagicHeader(BlobInterface *blob, JsObject *result) {
+  static const int kMagicHeaderLength = 8;
+  if (blob->Length() < kMagicHeaderLength) {
+    return false;
+  }
+  uint8 magic_header[kMagicHeaderLength];
+  if (kMagicHeaderLength != blob->Read(magic_header, 0, kMagicHeaderLength)) {
+    return false;
+  }
+
+  // The JPEG Start-Of-Image (SOI) marker is 0xFFD8.
+  if (magic_header[0] == 0xFF && magic_header[1] == 0xD8) {
+    return ExtractMetaDataJpeg(blob, result);
+  }
+
+  // PNG files start with the 8-byte header as per
+  // http://www.w3.org/TR/2003/REC-PNG-20031110/
+  if (magic_header[0] == 0x89 && magic_header[1] == 0x50 &&
+      magic_header[2] == 0x4E && magic_header[3] == 0x47 &&
+      magic_header[4] == 0x0D && magic_header[5] == 0x0A &&
+      magic_header[6] == 0x1A && magic_header[7] == 0x0A) {
+    return ExtractMetaDataPng(blob, result);
+  }
+
+  // The Windows Compound Binary File Format Specification at
+  // http://download.microsoft.com/download/0/B/E/
+  //     0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/
+  //     WindowsCompoundBinaryFileFormatSpecification.pdf
+  // gives the 8-byte magic header for Microsoft Office documents, which
+  // includes Word (.doc), Excel (.xls), and Powerpoint (.ppt).
+  if (magic_header[0] == 0xD0 && magic_header[1] == 0xCF &&
+      magic_header[2] == 0x11 && magic_header[3] == 0xE0 &&
+      magic_header[4] == 0xA1 && magic_header[5] == 0xB1 &&
+      magic_header[6] == 0x1A && magic_header[7] == 0xE1) {
+    return ExtractMetaDataMsOffice(blob, result);
+  }
+
+  // PDF files start with "%PDF-".
+  if (magic_header[0] == 0x25 && magic_header[1] == 0x50 &&
+      magic_header[2] == 0x44 && magic_header[3] == 0x46 &&
+      magic_header[4] == 0x2D) {
+    result->SetPropertyString(
+        std::string16(STRING16(L"mimeType")),
+        std::string16(STRING16(L"application/pdf")));
+    return true;
+  }
+
+  // Zip files start with "PK\x03\x04".
+  if (magic_header[0] == 0x50 && magic_header[1] == 0x4B &&
+      magic_header[2] == 0x03 && magic_header[3] == 0x04) {
+    result->SetPropertyString(
+        std::string16(STRING16(L"mimeType")),
+        std::string16(STRING16(L"application/zip")));
+    return true;
+  }
+
+  // The MPEG audio file format is documented at
+  // http://www.mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm
+  // In particular, of the 4-byte frame header, the first 11 bits (in position
+  // 31-21) must be on, and bits at position (18-17) must be 01 (for MPEG layer
+  // III, which is where the name "mp3" comes from).
+  // As per the document referenced above, we want bits A and C from:
+  // AAAAAAAA AAABBCCD EEEEFFGH IIJJKLMM.
+  if (magic_header[0] == 0xFF && (magic_header[1] & 0xE6) == 0xE2) {
+    result->SetPropertyString(
+        std::string16(STRING16(L"mimeType")),
+        std::string16(STRING16(L"application/mpeg")));
+    return true;
+  }
+
+  return false;
+}
+
 void ExtractMetaData(BlobInterface *blob, JsObject *result) {
-  if (ExtractMetaDataJpeg(blob, result)) return;
-  if (ExtractMetaDataPng(blob, result)) return;
+  if (ExtractViaMagicHeader(blob, result)) return;
 
   // If we didn't match an explicit mime-type, above, then we fall back on
   // the default "application/octet-stream".

[gears-eng] a code review: 12193738 Make desktop.extractMetaData now recognize MP3, PDF and

Reply via email to