Also, I'm curious about automatic detection of things like session IDs. Does anyone have some good ideas about that?

Hi,

I once implemented a basic session id detection in perl. I only checked if there is a session id in a link.

1. Session IDs often have special parameter names.
2. Session IDs often have more then 10 signs and many switches between numbers und letters.
3. Session IDs often have more then 10 signs and many switches between upper case and lower case letters.


This works very well at our event crawler. I added the code.

Maybe this helps.

Bye

Matthias

--

sub check_sessionid {
my $link = shift || return;
return 1 if $link =~ /sess=/;
return 1 if $link =~ /SID=/;
my @bigparts = grep (/\w{10,}/, split(/\W/, $link));
foreach my $part (@bigparts) {
if ( (Master::Amount("[a-zA-Z]\\d", $part) > 5 ) && (Master::Amount("\\d[a-zA-Z]", $part) > 5 ) ) {
return 1;
}
if ( (Master::Amount2("[a-z][A-Z]", $part) > 5 ) && (Master::Amount2("[A-Z][a-z]", $part) > 5 ) ) {
return 1;
}
}
return 0;
}



package Master;

sub Amount {
  my ($suche,$in)[EMAIL PROTECTED];
  my $count=0;
  while ($in =~ /$suche/gi){ $count++; }
  return $count;
}

sub Amount2 {
  my ($suche,$in)[EMAIL PROTECTED];
  my $count=0;
  while ($in =~ /$suche/g){ $count++; }
  return $count;
}


------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to