Re: identify ISSN numbers in an mrc file

2016-11-02 Thread Patrick Hochstenbach
In Catmandu you can do this with this script (which will also filter out all 
valid ISSN numbers)…

# cpanm Catmandu Catmandu::Identifier

$ cat myfix.txt
marc_map('***',text.$append)

filter(text,'(\b\d{4}-?\d{3}[\dxX]\b)')
replace_all(text.*,'.*(\b\d{4}-?\d{3}[\dxX]\b).*',$1)

do list(path:text)
  unless is_valid_issn(.)
reject()
  end
end

vacuum()

select exists(text)

join_field(text,' ; ')

retain(_id,text)

$ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

Patrick

> On 2 Nov 2016, at 11:29, Sergio Letuche  wrote:
> 
> thank you very much
> 
> 2016-11-02 12:28 GMT+02:00 Ben Soares :
> Hi Sergio,
> 
> Try
> 
> ^\d{4}-\d{3}[\dxX]$
> 
> if you know that they will always be formatted with a hyphen in the middle, or
> 
> ^\d{4}-?\d{3}[\dxX]$
> 
> if you can't be sure of that.
> 
> (and if you're interested in spotting ISSNs in the middle of a field use
> \b\d{4}-?\d{3}[\dxX]\b
> but beware this also finds year ranges [e.g. 1990-2000]!)
> 
> Ben
> 
> 
> On Wednesday, 2 November 2016 12:06:15 GMT Sergio Letuche wrote:
> > Thank you dear Stefano,
> >
> > i am aware of this module, it works great.
> >
> > But my problem is, what clever regex to use, in order to identify if a
> > subfield's content, is an ISSN number. Say our mrc has ISSN numbers thrown
> > in any tag you could imagine...
> >
> > So my approach, would be, to search the whole mrc, but i do non know which
> > regex to use...
> >
> > 2016-11-02 11:52 GMT+02:00 Stefano Bargioni :
> > > Hi, Sergio:
> > > you can try MARCgrep http://en.pusc.it/bib/MARCgrep.
> > > Its help is:
> > >
> > > MARCgrep.pl
> > >
> > >Extracts MARC records that match a condition on fields. Count and
> > >invert are available.
> > >
> > > SYNOPSIS
> > >
> > >MARCgrep.pl [options] [-e condition] file.mrc
> > >
> > > Options:
> > >   -h   print this help message and exit
> > >   -c   count only
> > >   -e   condition
> > >   -f   comma separated list of fields to print
> > >   -o   output format "marc" | "line" | "INLINE"
> > >   -s   separator string for condition, default ","
> > >   -v   invert match
> > >
> > > Condition:
> > >   -e  'tag,indicator1,indicator2,subfield,value'
> > >
> > > OPTIONS
> > >
> > >-h  Print this message and exit.
> > >
> > >-c  Count and print number of matching records
> > >
> > >-e  The condition to match in the record.
> > >
> > > For data fields, the syntax is:
> > >   tag,indicator1,indicator2,subfield,value
> > >
> > > where tag, indicator1, indicator2, subfield, and value are
> > >
> > > regular expressions patterns.
> > >
> > > Do not put spaces around the separators.
> > >
> > > For control fields, the syntax is:
> > >   tag,pos1,pos2,value
> > >
> > > where tag starts with '00' (use '000' or 'LDR' for
> > >
> > > leader), pos1 is the starting position,
> > >
> > > pos2 is the ending position, both 0-based. Value is a
> > >
> > > regular expression.
> > >
> > > Default condition (-e not specified) matches any data
> > >
> > > field.
> > >
> > > For control fields, only the tag is mandatory.
> > >
> > > Examples: -e '100,,,a,^A' will match records that contain
> > >
> > > 100$a starting with 'A'
> > >
> > >   -e '008,35,37,(ita|eng)' will match records with
> > >
> > > language ita or eng in 008
> > >
> > >   -e '(1|7)(0|1)(0|1),,2' will match
> > >
> > > 100,110,111,700,710,711 with ind2=2
> > >
> > >-f  Comma separated list of fields (tags) to print if output
> > >
> > > format
> > >
> > >is "line" or "inline". Default is any field.
> > >
> > > Note that if a tag is preceded by '#' sign (like in
> > >
> > > '#nnn'), a
> > >
> > >count of occurrences will be printed instead.
> > >
> > > Examples: -f '100,245' will print field 100 and 245
> > >
> > >   -f '400,#400' will print all occurrences of 400
> > >
> > > field as well as the number of its occurrences
> > >
> > >-o  Output format: "marc" for ISO2709, "line" for each subfield
> > >
> > > in
> > >
> > >a line, "inline" (default) for each field in a line.
> > >
> > >-s  Specify a string separator for condition. Default is ','.
> > >
> > >-v  Invert the sense of matching, to select non-matching
> > >
> > > records.
> > >
> > >-V  Print the version and exit.
> > >
> > >file.mrc
> > >
> > >The mandatory ISO2709 file to read. Can be STDIN, '-'.
> > >
> > > DESCRIPTION
> > >
> > >Like grep, the famous Unix utility, MARCgrep.pl allows to filter
> > >
> > > MARC
> > 

Re: identify ISSN numbers in an mrc file

2016-11-02 Thread Sergio Letuche
thank you very much

2016-11-02 12:28 GMT+02:00 Ben Soares :

> Hi Sergio,
>
> Try
>
> ^\d{4}-\d{3}[\dxX]$
>
> if you know that they will always be formatted with a hyphen in the
> middle, or
>
> ^\d{4}-?\d{3}[\dxX]$
>
> if you can't be sure of that.
>
> (and if you're interested in spotting ISSNs in the middle of a field use
> \b\d{4}-?\d{3}[\dxX]\b
> but beware this also finds year ranges [e.g. 1990-2000]!)
>
> Ben
>
>
> On Wednesday, 2 November 2016 12:06:15 GMT Sergio Letuche wrote:
> > Thank you dear Stefano,
> >
> > i am aware of this module, it works great.
> >
> > But my problem is, what clever regex to use, in order to identify if a
> > subfield's content, is an ISSN number. Say our mrc has ISSN numbers
> thrown
> > in any tag you could imagine...
> >
> > So my approach, would be, to search the whole mrc, but i do non know
> which
> > regex to use...
> >
> > 2016-11-02 11:52 GMT+02:00 Stefano Bargioni :
> > > Hi, Sergio:
> > > you can try MARCgrep http://en.pusc.it/bib/MARCgrep.
> > > Its help is:
> > >
> > > MARCgrep.pl
> > >
> > >Extracts MARC records that match a condition on fields. Count
> and
> > >invert are available.
> > >
> > > SYNOPSIS
> > >
> > >MARCgrep.pl [options] [-e condition] file.mrc
> > >
> > > Options:
> > >   -h   print this help message and exit
> > >   -c   count only
> > >   -e   condition
> > >   -f   comma separated list of fields to print
> > >   -o   output format "marc" | "line" | "INLINE"
> > >   -s   separator string for condition, default ","
> > >   -v   invert match
> > >
> > > Condition:
> > >   -e  'tag,indicator1,indicator2,subfield,value'
> > >
> > > OPTIONS
> > >
> > >-h  Print this message and exit.
> > >
> > >-c  Count and print number of matching records
> > >
> > >-e  The condition to match in the record.
> > >
> > > For data fields, the syntax is:
> > >   tag,indicator1,indicator2,subfield,value
> > >
> > > where tag, indicator1, indicator2, subfield, and value
> are
> > >
> > > regular expressions patterns.
> > >
> > > Do not put spaces around the separators.
> > >
> > > For control fields, the syntax is:
> > >   tag,pos1,pos2,value
> > >
> > > where tag starts with '00' (use '000' or 'LDR' for
> > >
> > > leader), pos1 is the starting position,
> > >
> > > pos2 is the ending position, both 0-based. Value is a
> > >
> > > regular expression.
> > >
> > > Default condition (-e not specified) matches any data
> > >
> > > field.
> > >
> > > For control fields, only the tag is mandatory.
> > >
> > > Examples: -e '100,,,a,^A' will match records that
> contain
> > >
> > > 100$a starting with 'A'
> > >
> > >   -e '008,35,37,(ita|eng)' will match records
> with
> > >
> > > language ita or eng in 008
> > >
> > >   -e '(1|7)(0|1)(0|1),,2' will match
> > >
> > > 100,110,111,700,710,711 with ind2=2
> > >
> > >-f  Comma separated list of fields (tags) to print if output
> > >
> > > format
> > >
> > >is "line" or "inline". Default is any field.
> > >
> > > Note that if a tag is preceded by '#' sign (like in
> > >
> > > '#nnn'), a
> > >
> > >count of occurrences will be printed instead.
> > >
> > > Examples: -f '100,245' will print field 100 and 245
> > >
> > >   -f '400,#400' will print all occurrences of
> 400
> > >
> > > field as well as the number of its occurrences
> > >
> > >-o  Output format: "marc" for ISO2709, "line" for each
> subfield
> > >
> > > in
> > >
> > >a line, "inline" (default) for each field in a line.
> > >
> > >-s  Specify a string separator for condition. Default is
> ','.
> > >
> > >-v  Invert the sense of matching, to select non-matching
> > >
> > > records.
> > >
> > >-V  Print the version and exit.
> > >
> > >file.mrc
> > >
> > >The mandatory ISO2709 file to read. Can be STDIN, '-'.
> > >
> > > DESCRIPTION
> > >
> > >Like grep, the famous Unix utility, MARCgrep.pl allows to filter
> > >
> > > MARC
> > >
> > >bibliographic
> > >
> > > records based on conditions on tag, indicators, and field
> value.
> > >
> > >Conditions can be applied to data fields, control fields or the
> > >
> > > leader.
> > >
> > >In case of data fields, the condition can specify tag,
> indicators,
> > >subfield and value using regular
> > >
> > > expressions. In case of control fields, the condition must
> contain
> > >
> > > the
> > >
> > >tag name, the starting
> > >
> > > and ending position (both 0-based), and a 

Re: identify ISSN numbers in an mrc file

2016-11-02 Thread Ben Soares
Hi Sergio,

Try

^\d{4}-\d{3}[\dxX]$

if you know that they will always be formatted with a hyphen in the middle, or

^\d{4}-?\d{3}[\dxX]$

if you can't be sure of that.

(and if you're interested in spotting ISSNs in the middle of a field use
\b\d{4}-?\d{3}[\dxX]\b
but beware this also finds year ranges [e.g. 1990-2000]!)

Ben


On Wednesday, 2 November 2016 12:06:15 GMT Sergio Letuche wrote:
> Thank you dear Stefano,
> 
> i am aware of this module, it works great.
> 
> But my problem is, what clever regex to use, in order to identify if a
> subfield's content, is an ISSN number. Say our mrc has ISSN numbers thrown
> in any tag you could imagine...
> 
> So my approach, would be, to search the whole mrc, but i do non know which
> regex to use...
> 
> 2016-11-02 11:52 GMT+02:00 Stefano Bargioni :
> > Hi, Sergio:
> > you can try MARCgrep http://en.pusc.it/bib/MARCgrep.
> > Its help is:
> > 
> > MARCgrep.pl
> > 
> >Extracts MARC records that match a condition on fields. Count and
> >invert are available.
> > 
> > SYNOPSIS
> > 
> >MARCgrep.pl [options] [-e condition] file.mrc
> >
> > Options:
> >   -h   print this help message and exit
> >   -c   count only
> >   -e   condition
> >   -f   comma separated list of fields to print
> >   -o   output format "marc" | "line" | "INLINE"
> >   -s   separator string for condition, default ","
> >   -v   invert match
> > 
> > Condition:
> >   -e  'tag,indicator1,indicator2,subfield,value'
> > 
> > OPTIONS
> > 
> >-h  Print this message and exit.
> >
> >-c  Count and print number of matching records
> >
> >-e  The condition to match in the record.
> >
> > For data fields, the syntax is:
> >   tag,indicator1,indicator2,subfield,value
> > 
> > where tag, indicator1, indicator2, subfield, and value are
> > 
> > regular expressions patterns.
> > 
> > Do not put spaces around the separators.
> > 
> > For control fields, the syntax is:
> >   tag,pos1,pos2,value
> > 
> > where tag starts with '00' (use '000' or 'LDR' for
> > 
> > leader), pos1 is the starting position,
> > 
> > pos2 is the ending position, both 0-based. Value is a
> > 
> > regular expression.
> > 
> > Default condition (-e not specified) matches any data
> > 
> > field.
> > 
> > For control fields, only the tag is mandatory.
> > 
> > Examples: -e '100,,,a,^A' will match records that contain
> > 
> > 100$a starting with 'A'
> > 
> >   -e '008,35,37,(ita|eng)' will match records with
> > 
> > language ita or eng in 008
> > 
> >   -e '(1|7)(0|1)(0|1),,2' will match
> > 
> > 100,110,111,700,710,711 with ind2=2
> > 
> >-f  Comma separated list of fields (tags) to print if output
> > 
> > format
> > 
> >is "line" or "inline". Default is any field.
> >
> > Note that if a tag is preceded by '#' sign (like in
> > 
> > '#nnn'), a
> > 
> >count of occurrences will be printed instead.
> >
> > Examples: -f '100,245' will print field 100 and 245
> > 
> >   -f '400,#400' will print all occurrences of 400
> > 
> > field as well as the number of its occurrences
> > 
> >-o  Output format: "marc" for ISO2709, "line" for each subfield
> > 
> > in
> > 
> >a line, "inline" (default) for each field in a line.
> >
> >-s  Specify a string separator for condition. Default is ','.
> >
> >-v  Invert the sense of matching, to select non-matching
> > 
> > records.
> > 
> >-V  Print the version and exit.
> >
> >file.mrc
> >
> >The mandatory ISO2709 file to read. Can be STDIN, '-'.
> > 
> > DESCRIPTION
> > 
> >Like grep, the famous Unix utility, MARCgrep.pl allows to filter
> > 
> > MARC
> > 
> >bibliographic
> >
> > records based on conditions on tag, indicators, and field value.
> >
> >Conditions can be applied to data fields, control fields or the
> > 
> > leader.
> > 
> >In case of data fields, the condition can specify tag, indicators,
> >subfield and value using regular
> >
> > expressions. In case of control fields, the condition must contain
> > 
> > the
> > 
> >tag name, the starting
> >
> > and ending position (both 0-based), and a regular expressions for
> > 
> > the
> > 
> >value.
> >
> >Options -c and -v allow respectively to count matching records and
> > 
> > to

Re: identify ISSN numbers in an mrc file

2016-11-02 Thread Sergio Letuche
Thank you dear Stefano,

i am aware of this module, it works great.

But my problem is, what clever regex to use, in order to identify if a
subfield's content, is an ISSN number. Say our mrc has ISSN numbers thrown
in any tag you could imagine...

So my approach, would be, to search the whole mrc, but i do non know which
regex to use...

2016-11-02 11:52 GMT+02:00 Stefano Bargioni :

> Hi, Sergio:
> you can try MARCgrep http://en.pusc.it/bib/MARCgrep.
> Its help is:
>
> MARCgrep.pl
>Extracts MARC records that match a condition on fields. Count and
>invert are available.
>
> SYNOPSIS
>MARCgrep.pl [options] [-e condition] file.mrc
>
> Options:
>   -h   print this help message and exit
>   -c   count only
>   -e   condition
>   -f   comma separated list of fields to print
>   -o   output format "marc" | "line" | "INLINE"
>   -s   separator string for condition, default ","
>   -v   invert match
>
> Condition:
>   -e  'tag,indicator1,indicator2,subfield,value'
>
> OPTIONS
>-h  Print this message and exit.
>
>-c  Count and print number of matching records
>
>-e  The condition to match in the record.
> For data fields, the syntax is:
>
>   tag,indicator1,indicator2,subfield,value
>
> where tag, indicator1, indicator2, subfield, and value are
> regular expressions patterns.
> Do not put spaces around the separators.
>
> For control fields, the syntax is:
>
>   tag,pos1,pos2,value
>
> where tag starts with '00' (use '000' or 'LDR' for
> leader), pos1 is the starting position,
> pos2 is the ending position, both 0-based. Value is a
> regular expression.
>
> Default condition (-e not specified) matches any data
> field.
> For control fields, only the tag is mandatory.
>
> Examples: -e '100,,,a,^A' will match records that contain
> 100$a starting with 'A'
>   -e '008,35,37,(ita|eng)' will match records with
> language ita or eng in 008
>   -e '(1|7)(0|1)(0|1),,2' will match
> 100,110,111,700,710,711 with ind2=2
>
>-f  Comma separated list of fields (tags) to print if output
> format
>is "line" or "inline". Default is any field.
> Note that if a tag is preceded by '#' sign (like in
> '#nnn'), a
>count of occurrences will be printed instead.
>
> Examples: -f '100,245' will print field 100 and 245
>   -f '400,#400' will print all occurrences of 400
> field as well as the number of its occurrences
>
>-o  Output format: "marc" for ISO2709, "line" for each subfield
> in
>a line, "inline" (default) for each field in a line.
>
>-s  Specify a string separator for condition. Default is ','.
>
>-v  Invert the sense of matching, to select non-matching
> records.
>
>-V  Print the version and exit.
>
>file.mrc
>The mandatory ISO2709 file to read. Can be STDIN, '-'.
>
> DESCRIPTION
>Like grep, the famous Unix utility, MARCgrep.pl allows to filter
> MARC
>bibliographic
> records based on conditions on tag, indicators, and field value.
>
>Conditions can be applied to data fields, control fields or the
> leader.
>
>In case of data fields, the condition can specify tag, indicators,
>subfield and value using regular
> expressions. In case of control fields, the condition must contain
> the
>tag name, the starting
> and ending position (both 0-based), and a regular expressions for
> the
>value.
>
>Options -c and -v allow respectively to count matching records and
> to
>invert the match.
>
>If option -c is not specified, the output format can be "line" or
>"inline" (both human readable),
> or "marc" for MARC binary (ISO2709). For formats "line" or
> "inline",
>the -f option allows to specify
> fields to print.
>
>You can chain more conditions using
>
>./MARCGgrep.pl -o marc -e condition1 file.mrc | ./MARCGgrep.pl -e
>condition2 -
>
> KNOWN ISSUES
>Performance.
>
>Accepts and returns only UTF-8.
>
>Checks are case sensitive.
>
> AUTHOR
>Pontificia Universita' della Santa Croce 
>
>Stefano Bargioni 
>
> SEE ALSO
>marktriggs / marcgrep at 
> for
>filtering large data sets
>
>
> > On 02 nov 2016, at 09:57, Sergio Letuche 
> wrote:
> >
> > Hello community,
> >
> > how would you treat the following?
> >
> > I need a way to identify all tags - subfields, that have 

Re: identify ISSN numbers in an mrc file

2016-11-02 Thread Stefano Bargioni
Hi, Sergio:
you can try MARCgrep http://en.pusc.it/bib/MARCgrep.
Its help is:

MARCgrep.pl
   Extracts MARC records that match a condition on fields. Count and
   invert are available.

SYNOPSIS
   MARCgrep.pl [options] [-e condition] file.mrc

Options:
  -h   print this help message and exit
  -c   count only
  -e   condition
  -f   comma separated list of fields to print
  -o   output format "marc" | "line" | "INLINE"
  -s   separator string for condition, default ","
  -v   invert match

Condition:
  -e  'tag,indicator1,indicator2,subfield,value'

OPTIONS
   -h  Print this message and exit.

   -c  Count and print number of matching records

   -e  The condition to match in the record.
For data fields, the syntax is:

  tag,indicator1,indicator2,subfield,value

where tag, indicator1, indicator2, subfield, and value are 
regular expressions patterns.
Do not put spaces around the separators.

For control fields, the syntax is:

  tag,pos1,pos2,value

where tag starts with '00' (use '000' or 'LDR' for leader), 
pos1 is the starting position,
pos2 is the ending position, both 0-based. Value is a regular 
expression.

Default condition (-e not specified) matches any data field.
For control fields, only the tag is mandatory.

Examples: -e '100,,,a,^A' will match records that contain 100$a 
starting with 'A'
  -e '008,35,37,(ita|eng)' will match records with 
language ita or eng in 008
  -e '(1|7)(0|1)(0|1),,2' will match 
100,110,111,700,710,711 with ind2=2

   -f  Comma separated list of fields (tags) to print if output format
   is "line" or "inline". Default is any field.
Note that if a tag is preceded by '#' sign (like in '#nnn'), a
   count of occurrences will be printed instead.

Examples: -f '100,245' will print field 100 and 245
  -f '400,#400' will print all occurrences of 400 field 
as well as the number of its occurrences

   -o  Output format: "marc" for ISO2709, "line" for each subfield in
   a line, "inline" (default) for each field in a line.

   -s  Specify a string separator for condition. Default is ','.

   -v  Invert the sense of matching, to select non-matching records.

   -V  Print the version and exit.

   file.mrc
   The mandatory ISO2709 file to read. Can be STDIN, '-'.

DESCRIPTION
   Like grep, the famous Unix utility, MARCgrep.pl allows to filter MARC
   bibliographic
records based on conditions on tag, indicators, and field value.

   Conditions can be applied to data fields, control fields or the leader.

   In case of data fields, the condition can specify tag, indicators,
   subfield and value using regular
expressions. In case of control fields, the condition must contain the
   tag name, the starting
and ending position (both 0-based), and a regular expressions for the
   value.

   Options -c and -v allow respectively to count matching records and to
   invert the match.

   If option -c is not specified, the output format can be "line" or
   "inline" (both human readable),
or "marc" for MARC binary (ISO2709). For formats "line" or "inline",
   the -f option allows to specify
fields to print.

   You can chain more conditions using

   ./MARCGgrep.pl -o marc -e condition1 file.mrc | ./MARCGgrep.pl -e
   condition2 -

KNOWN ISSUES
   Performance.

   Accepts and returns only UTF-8.

   Checks are case sensitive.

AUTHOR
   Pontificia Universita' della Santa Croce 

   Stefano Bargioni 

SEE ALSO
   marktriggs / marcgrep at  for
   filtering large data sets


> On 02 nov 2016, at 09:57, Sergio Letuche  wrote:
> 
> Hello community,
> 
> how would you treat the following?
> 
> I need a way to identify all tags - subfields, that have stored an ISSN 
> number in them. 
> 
> What would you suggest as a clever approach for this?
> 
> Thank you



identify ISSN numbers in an mrc file

2016-11-02 Thread Sergio Letuche
Hello community,

how would you treat the following?

I need a way to identify all tags - subfields, that have stored an ISSN
number in them.

What would you suggest as a clever approach for this?

Thank you