On 04/20/2017 06:15 PM, Dossy Shiobara wrote:
On 4/20/17 4:43 PM, Matt Morgan wrote:
I guess what I'm asking, is there an easy path from mail-archive.com
search results into a spreadsheet (I guess mySQL or postgres would be
OK too) or some other kind of analysis tool?
I thought about doing this in Node.js but that would require a bit more
machinery that isn't available "out of the box" and I didn't want you to
get hung up on any dependencies, so, here it is in PHP which should work
with just out-of-the-box PHP (on most platforms, anyway):

$ php -r '
     $dom = new DOMDocument;
$dom->loadHTML(file_get_contents("https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&f=1";));
     $doc = simplexml_import_dom($dom);
     $out = fopen("php://output", "w");
     fputcsv($out, array("link", "subject", "date", "name", "message"));
     $msg = array();
     foreach ($doc->body->div[0]->children() as $node) {
         switch ($node->getName()) {
             case "h3":
                 $msg["subj"] = (string) $node->span->a;
                 $msg["link"] = "https://www.mail-archive.com"; . (string)
$node->span->a["href"];
                 break;
             case "div":
                 $msg["date"] = (string) $node->span[0]->span->a;
                 $msg["name"] = (string) $node->span[2]->a;
                 break;
             case "blockquote":
                 $msg["body"] = (string) $node->span->pre;
                 break;
             case "br":
                 fputcsv($out, array($msg["link"], $msg["subj"],
                     $msg["date"], $msg["name"], $msg["body"]));
                 $msg = array();
                 break;
             default: break;
         }
     }' | tee msgs.csv
Closing the loop on this, Dossy's script works perfectly. In my case I had to install php-xml, but that was it (and if you use php at all you probably already have that).

I think I may have found a bug in the search engine. Compare these two queries:

1. search for

(+job OR +position)

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=%28%2Bjob+OR+%2Bposition%29&start=0

(1167 results, all of which have at least one occurrence of "job" or "position").

2. search for

job OR position

https://www.mail-archive.com/search?l=mcn-l%40mcn.edu&q=job+OR+position&x=13&y=18

(10185 results, many of which have no occurrences of either "job" or "position").

I had expected query #2 to do what I wanted, but I had to fuss with the parens and plus signs to get it to actually limit the results to items with one of the words. Even

+job OR +position

without the parens got me results with neither word.

Thanks,
Matt

_______________________________________________
Gossip mailing list
https://www.mail-archive.com/gossip@mail-archive.com
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Reply via email to