Bug#483247: Updating grepmail

2015-03-23 Thread David Coppit
For what it's worth, I'm in the process of updating grepmail.

I don't have ready access to the full CPAN-Testers test matrix, so I can't
guarantee that all tests will be passing everywhere. But the obvious
failures will be fixed, plus a few bug fixes as well as support for lzip
and xz added.


Bug#432083: Fixed in grepmail versions 5.3034

2009-08-23 Thread David Coppit

Thanks for the bug report. I've fixed it and will push out a new release
later today.



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#234795: Need more information

2009-08-23 Thread David Coppit

Hi there,

I need more information to debug this. Please either confirm the bug and
provide more information, or mark this bug as not a bug.

grepmail uses Mail::Mbox::MessageParser, which is designed to use memory
proportional to the largest email message in a mailbox. I verified that it
does indeed operate this way, using a 54MB mailbox:

  mbox size: 56683943
  max email size:11182857
  max read buffer:   11184795 -- Biggest size of M::M::MP's read buffer
  folder_reader: 11186558 -- Biggest size of the M::M::MP Perl object

Some stats from ps(1):

  Plain text mailbox:

  min real memory:  4976640
  min virtual memory: 618000384
  max real memory: 38674432
  max virtual memory: 651546624

  Gzip compressed:

  min real memory:  5005312
  min virtual memory: 618016768
  max real memory: 38694912
  max virtual memory: 651563008

I also tried a 540MB mailbox, created by concatenating the mailbox 10
times:

  Plain text x10:

  min real memory:  4976640
  min virtual memory: 618000384
  max real memory: 40292352
  max virtual memory: 652021760

  Gzip compressed x10:

  min real memory:  5005312
  min virtual memory: 618016768
  max real memory: 40284160
  max virtual memory: 652038144

The numbers above were basically the same for a 23KB mailbox. Also note
that this command:

  perl -e 'system ps -o rss,vsz $$'

consumes 1175552 real and 615645184 virtual memory, so the numbers above
are not out of the ordinary.

If you could run the attached anonymize_mailbox script on your mailbox,
verify that memory usage is still bad, then send the mailbox to me, I can
debug this better.

Another idea: perhaps your mailbox is malformed, such that grepmail only
sees 1 email in the whole mailbox. You can check this by running:

  grepmail -r . my_big_mailbox

If you want to confirm that you have a very large email in your mailbox,
find this line in grepmail:

  my $email = $folder_reader-read_next_email();

and follow it with this line:

  print length($$email) . \n;

then run something like:

  grepmail nonexistent_pattern my_big_mailbox | sort -n

Regards,
David

_
David Coppit   http://coppit.org/



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#254045: -d bug: not a bug?

2009-08-23 Thread David Coppit

I believe this is not a bug. I suspect you entered a unicode character
that looks like - but is not. Getopt::Std fails to get options unless
the option dash is exactly Here's a program that you can use to test it.

  use Getopt::Std;
  use Data::Dumper;

  ($c) = $ARGV[0] =~ /^(.)/;
  print Character $c is ord( . ord($c) . )\n;

  getopt('d',\%new_opts);
  print Dumper \%new_opts;

When I run the program, I get:

  $ perl a -d 'before 6/1/04'
  Character - is ord(45)
  $VAR1 = {
'd' = 'before 6/1/04'
  };

But when I copy and paste - from the website for your bug report I get:

  $ perl a ???d 'before 6/1/04'
  Character ? is ord(226)
  $VAR1 = {};

Please confirm and either provide more information or close the bug as
not a bug.

Thanks,
David

_
David Coppit   http://coppit.org/



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#234795: Need more information

2009-08-23 Thread David Coppit

Forgot the anonymize_mailbox script.

On Sun, 23 Aug 2009, David Coppit wrote:


Hi there,

I need more information to debug this. Please either confirm the bug and
provide more information, or mark this bug as not a bug.

grepmail uses Mail::Mbox::MessageParser, which is designed to use memory
proportional to the largest email message in a mailbox. I verified that it
does indeed operate this way, using a 54MB mailbox:

 mbox size: 56683943
 max email size:11182857
 max read buffer:   11184795 -- Biggest size of M::M::MP's read buffer
 folder_reader: 11186558 -- Biggest size of the M::M::MP Perl object

Some stats from ps(1):

 Plain text mailbox:

 min real memory:  4976640
 min virtual memory: 618000384
 max real memory: 38674432
 max virtual memory: 651546624

 Gzip compressed:

 min real memory:  5005312
 min virtual memory: 618016768
 max real memory: 38694912
 max virtual memory: 651563008

I also tried a 540MB mailbox, created by concatenating the mailbox 10
times:

 Plain text x10:

 min real memory:  4976640
 min virtual memory: 618000384
 max real memory: 40292352
 max virtual memory: 652021760

 Gzip compressed x10:

 min real memory:  5005312
 min virtual memory: 618016768
 max real memory: 40284160
 max virtual memory: 652038144

The numbers above were basically the same for a 23KB mailbox. Also note
that this command:

 perl -e 'system ps -o rss,vsz $$'

consumes 1175552 real and 615645184 virtual memory, so the numbers above
are not out of the ordinary.

If you could run the attached anonymize_mailbox script on your mailbox,
verify that memory usage is still bad, then send the mailbox to me, I can
debug this better.

Another idea: perhaps your mailbox is malformed, such that grepmail only
sees 1 email in the whole mailbox. You can check this by running:

 grepmail -r . my_big_mailbox

If you want to confirm that you have a very large email in your mailbox,
find this line in grepmail:

 my $email = $folder_reader-read_next_email();

and follow it with this line:

 print length($$email) . \n;

then run something like:

 grepmail nonexistent_pattern my_big_mailbox | sort -n

Regards,
David

_
David Coppit   http://coppit.org/



_
David Coppit   http://coppit.org/#!/usr/bin/perl -w

$VERSION = '1.00';

use strict;
use FileHandle;

#---

my $LINE = 0;
my $FILE_HANDLE = undef;
my $START = 0;
my $END = 0;
my $READ_BUFFER = '';

sub reset_file
{
  my $file_handle = shift;

  $FILE_HANDLE = $file_handle;
  $LINE = 1;
  $START = 0;
  $END = 0;
  $READ_BUFFER = '';
}

#---

# Need this for a lookahead.
my $READ_CHUNK_SIZE = 0;

sub read_email
{
  # Undefined read buffer means we hit eof on the last read.
  return 0 unless defined $READ_BUFFER;

  my $line = $LINE;

  $START = $END;

  # Look for the start of the next email
  LOOK_FOR_NEXT_HEADER:
  while($READ_BUFFER =~ m/^(From\s.*\d:\d+:\d.* \d{4})/mg)
  {
$END = pos($READ_BUFFER) - length($1);

# Don't stop on email header for the first email in the buffer
next if $END == 0;

# Keep looking if the header we found is part of a Begin Included
# Message.
my $end_of_string = substr($READ_BUFFER, $END-200, 200);
if ($end_of_string =~
/\n-( Begin Included Message |Original Message)-\n[^\n]*\n*$/i)
{
  next;
}

# Found the next email!
my $email = substr($READ_BUFFER, $START, $END-$START);
$LINE += ($email =~ tr/\n//);

return (1, $email, $line);
  }

  # Didn't find next email in current buffer. Most likely we need to read some
  # more of the mailbox. Shift the current email to the front of the buffer
  # unless we've already done so.
  $READ_BUFFER = substr($READ_BUFFER,$START) unless $START == 0;
  $START = 0;

  # Start looking at the end of the buffer, but back up some in case the edge
  # of the newly read buffer contains the start of a new header. I believe the
  # RFC says header lines can be at most 90 characters long.
  my $search_position = length($READ_BUFFER) - 90;
  $search_position = 0 if $search_position  0;

  # Can't use sysread because it doesn't work with ungetc
  if ($READ_CHUNK_SIZE == 0)
  {
local $/ = undef;

if (eof $FILE_HANDLE)
{
  my $email = $READ_BUFFER;
  undef $READ_BUFFER;
  return (1, $email, $line);
}
else
{
  $READ_BUFFER = $FILE_HANDLE;
  pos($READ_BUFFER) = $search_position;
  goto LOOK_FOR_NEXT_HEADER;
}
  }
  else
  {
if (read($FILE_HANDLE, $READ_BUFFER, $READ_CHUNK_SIZE, 
length($READ_BUFFER)))
{
  pos($READ_BUFFER) = $search_position;
  goto LOOK_FOR_NEXT_HEADER

Bug#395268: 1.5000

2007-04-23 Thread David Coppit

On Sun, 14 Jan 2007, Joey Hess wrote:


I tried out 1.5000. I'm still seeing apparently the same hang with it
while building Mail::MboxParser..


Hi Tassilo,

It looks like changes to my module Mail::Mbox::MessageParser are causing
your module Mail::MboxParser to hang during make test. I debugged the
issue, and it appears to be a problem with the way that you change the
file position while testing the newline type in the file, as well as
another unnecessary seek in next_message_new().

Attached is a patch that fixes the problem(s).

Regards,
David

P.S. Joey, there are still two warnings issued during the make test step
of Tassilo's module. These are coming from my module, and will be fixed in
versions  1.5000.

_
David Coppit   [EMAIL PROTECTED]
The College of William and Maryhttp://coppit.org/

When the president does it that means that it is not illegal.
- Richard Nixon on domestic surveillance, 5/19/1977
Do I have the legal authority to do this? And the answer is, absolutely.
- George W. Bush on domestic surveillance, 12/19/2005--- MboxParser.pm   2005-12-08 05:15:39.0 -0500
+++ /Users/coppit/Desktop/MboxParser.pm 2007-04-23 13:54:45.0 -0400
@@ -519,7 +519,6 @@
 
 return undef if ref(\$p) eq 'SCALAR' or $p-end_of_file;
 
-seek $self-{READER}, $self-{CURR_POS}, SEEK_SET;
 my $nl = $self-{NL};
 my $mailref = $p-read_next_email;
 my ($header, $body) = split /$nl$nl/, $$mailref, 2;
@@ -794,6 +793,7 @@
 my $h = $self-{READER};
 my $newline;
 
+   my $old_position = tell $h;
 seek $h, 0, SEEK_SET;
 while (sysread $h, (my $c), 1) {
 if (ord($c) == 13) {
@@ -807,6 +807,7 @@
 last;
 }
 }
+   seek($h, $old_position, 0);
 return $newline;
 }
 


Bug#395268: hang in Mail::Mbox::MessageParser::Grep

2007-01-10 Thread David Coppit

On Tue, 9 Jan 2007, Joey Hess wrote:


David, it seems that there's a bug in the Grep implementation of the
MessageParser that can lead to a hang. See discussion at
http://bugs.debian.org/395268


Thanks for the heads up.


I share Steinar's confusion about what $self-{'CURRENT_EMAIL_INDEX'}
should be used for and how it relates to $self-{'email_number'}


I renamed CURRENT_EMAIL_INDEX TO CHUNK_INDEX, and I've added the following
documentation:

  # Reading grep data provides us with an array of potential email
  # starting locations. However, due to included emails and attachments,
  # we have to validate these locations as actually being the start of
  # emails. As a result, there may be more chunks in the array than
  # emails. So CHUNK_INDEX = email_number-1.

I've found that when the grep implementation goes into an infinite loop,
it's because the grep data does not match the file, as would be the case
if the file was modified after grep was run. My next release will detect
this case and try to recover.


As a temporary workaround, I've disabled the Grep implementation in the
Debian package.


I'll ping you when the release comes out so that you can test it. (I'm not
sure how to recreate the bug myself.)

Comments on one of the emails are below.

BTW, I see from the link you provided that this is marked as closed. Did
1.4005 fix the bug or not?

David

On Sun, 12 Nov 2006 03:29:04 +0100, Steinar H. Gunderson wrote:


If I had to guess, I'd assume $self-{'email_number'} was somehow a
_logical_ message number, and thus unfit for any sort of indexing.
That's a guess, though.


That's right. CURRENT_EMAIL_INDEX (now CHUNK_INDEX) refers to an entry in
the grep data array that corresponds to some block of text in the file
that begins From  In the case that this block of text may not be the
start of a new email, we will need to continue incrementing CHUNK_INDEX
and reading more chunks.


Part of the reason seems to be that _adjust_cache_data() somehow merges
or deletes messages without adjusting email_number; I'm not really sure
what it is supposed to do.


At the end, after validating the start of the next email, I add up the
chunk entries to get the final, validated entry for the email.

As for not checking the result of read(), that was sloppy programming on
my part. I thought there was no way for the grep data and the file to get
out of sync, but apparently someone has found a way. :) I don't know what
the cause is in this case, but I'll try to detect and avoid/correct it in
the next release.

_
David Coppit   [EMAIL PROTECTED]
The College of William and Maryhttp://coppit.org/

When the president does it that means that it is not illegal.
- Richard Nixon on domestic surveillance, 5/19/1977
Do I have the legal authority to do this? And the answer is, absolutely.
- George W. Bush on domestic surveillance, 12/19/2005


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#395268: hang in Mail::Mbox::MessageParser::Grep

2007-01-10 Thread David Coppit

On Tue, 9 Jan 2007, Joey Hess wrote:


David, it seems that there's a bug in the Grep implementation of the
MessageParser that can lead to a hang. See discussion at
http://bugs.debian.org/395268


I've found and fixed the problem. The issue was that Tassilo's test case
assumed that read_next_email would return some false value, when in fact
you are not supposed to call the method if end_of_file is true. i.e. he
did:

  while(my $email = $folder_reader-read_next_email()) {
print $output $$email;
  }

instead of:

  while(!$folder_reader-end_of_file()) {
my $email = $folder_reader-read_next_email();
print $output $$email;
  }

His way seems reasonable, so I added (back in?) support for
it---read_next_email now returns undef on EOF. I'll be releasing 1.5000
very soon.

David

P.S. Please CC me on bug reports as soon as my module is obviously
involved. I probably could have saved several people some debugging
effort. (I've thanked them all in my changelog.)

_
David Coppit   [EMAIL PROTECTED]
The College of William and Maryhttp://coppit.org/

When the president does it that means that it is not illegal.
- Richard Nixon on domestic surveillance, 5/19/1977
Do I have the legal authority to do this? And the answer is, absolutely.
- George W. Bush on domestic surveillance, 12/19/2005


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#365151: libmail-mbox-messageparser-perl: message splitting breaks

2006-07-10 Thread David Coppit

On Wed, 21 Jun 2006, Volker Kuhlmann wrote:


There may be something to look for. I'll forward you an email that
describes a problem. I'm hoping someone can send me a sample mailbox.


Your hopes can be upheld ;)

Attached a sample mailbox, and debug output. It's my spam box (using
grepmail is nifty to check for a false positive that has gone missing),
so don't read the text too closely.

The box contains 5 messages. Search string is @orcon.net.nz, and it
occurs in msg 4 and 5, but all msgs from 2 are returned as match. (If
the box was 1000 msgs longer, they would all be returned as well.)

From my reading of the debug output, Mbox/MessageParser fails to
recognise the ^from  in msg 1 as being part of the msg body. I can say
with certainty that mutt has never failed me in a decade with separating
mbox msgs. All my emails for the past 4 years have enforced correct
content-length: headers; I don't care what XYZ or DJB says, it works
fine. Mbox/MessageParser 1.20 hasn't failed me yet either.


Well, the mailbox is not valid. The reason appears to be that antispam.rc
has truncated the mailbox in an invalid way. Namely, the multipart
boundaries have been ignored, so that the ending for:

=_NextPart_000_0008_01C684B1.30F8FE30

Is no longer there. From RFC 1341:

The encapsulation boundary following the last body part is a
distinguished delimiter that indicates that no further body parts will
follow. Such a delimiter is identical to the previous delimiters, with the
addition of two more hyphens at the end of the line

In previous versions this ill-formed mailbox was not seen because I was
not parsing multi-part emails correctly. In previous versions, if an email
was part of the main multi-part email, I would incorrectly break the
multi-part email. In this case you *want* me to break the email.

I assume that pine and mutt are doing my previous incorrect behavior. (I
just checked with pine, and it breaks the email even if I put the ending
boundary marker *after* the next email.)

What I'll try to do is this:

- Look for ending boundary
- If a ^From  appears before the ending boundary is found, ignore it and
  consider the email to be a part.
- If the ending boundary is not found, consider the mailbox to be
  ill-formed. Emit a warning, back up, and search for the next ^From .

There's a nasty performance hit for ill-formed mailboxes as the parser
searches the rest of the file for the missing boundary, but perhaps that
will be an incentive for people to fix their mailboxes. :)

Eduard, as Joey noted, your mailbox has an invalid boundary as well. My
solution above should work for your case too. I'll work on this tonight
and email you all when it's fixed.

Regards,
David

_
David Coppit   [EMAIL PROTECTED]
The College of William and Maryhttp://coppit.org/

Single sanction punishment doesn't work for presidents or cheaters.
http://www.coppit.org/blog/archives/119


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#365151: libmail-mbox-messageparser-perl: message splitting breaks

2006-05-21 Thread David Coppit

FYI,

Today I have released Mail::Mbox::MessageParser version 1.4003, which
incorporates Eduard Bloch's patch for the grepmail returning all emails
bug.

Many thanks to all involved.

David

_
David Coppit   [EMAIL PROTECTED]
The College of William and Maryhttp://coppit.org/

... frothy eloquence neither convinces nor satisfies me.
-- 1899, Willard D. Vandiver (D-MO)


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#365151: libmail-mbox-messageparser-perl: message splitting breaks

2006-05-21 Thread David Coppit

On Sun, 21 May 2006, Eduard Bloch wrote:


It was not my patch, kudos to the creator (JoeyH, AFAICS).


Sorry. I was confused by the email he sent. I've updated the attribution
in the changelog for the next release.

David

_
David Coppit   [EMAIL PROTECTED]
The College of William and Maryhttp://coppit.org/

... frothy eloquence neither convinces nor satisfies me.
-- 1899, Willard D. Vandiver (D-MO)


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]