Hi Paul,

yes, the CSV indexer removes the CSV output before it starts a new one.
The problem here is that the indexer is run twice in a loop.

Possible work-arounds - assumed you're using the script bin/crawl:

1 after each indexing command in the loop, move the CSV output so that
  it gets not deleted later:

  mv nutch.csv nutch-$(date +%Y%m%d%H%M%S).csv

2 run the index step after the loop. Instead of passing a single segment,
  you need to index all segments in the segments/ folder. Just replace
    .../segments/$SEGMENT
  with
    -dir .../segments/
  Work-around 2 has the advantage that the index is a single file.


For the long term we might add the option to include a unique component
in the CSV output file (eg. a timestamp). Or add work-around 2 to the
crawl script. Let us know if you need such a solution for the development
branch.

A final note: the CSV indexer only works in local mode, it does not yet
work in distributed mode (on a real Hadoop cluster). It was initially
thought for debugging, not for larger production set up.

Best,
Sebastian


On 11/18/22 15:16, Paul Escobar wrote:
I'm using CSV indexer to write nutch data, but in the nutch.csv file I find
only the last thirteen lines, it seems like the indexer is overwriting the
file, I've read nutch CSV Indexer documentation but I haven't found any
configuration related to this situation. Could someone help me to get all
the lines extracted by the parser? This is the log output and the
index-writes.xml configuration:


org.apache.nutch.plugin.PluginManifestParser 2022-11-18 07:48:02,323 INFO
o.a.n.p.PluginManifestParser [main] Plugins: looking in:
/home/paulesco/Downloads/apache-nutch-1.19/plugins
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,753 INFO
o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true]
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,754 INFO
o.a.n.p.PluginRepository [main] Registered Plugins:
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO
o.a.n.p.PluginRepository [main] Regex URL Filter (urlfilter-regex)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO
o.a.n.p.PluginRepository [main] Html Parse Plug-in (parse-html)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO
o.a.n.p.PluginRepository [main] HTTP Framework (lib-http)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,756 INFO
o.a.n.p.PluginRepository [main] the nutch core extension points
(nutch-extensionpoints)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,756 INFO
o.a.n.p.PluginRepository [main] Basic Indexing Filter (index-basic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,757 INFO
o.a.n.p.PluginRepository [main] Anchor Indexing Filter (index-anchor)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,757 INFO
o.a.n.p.PluginRepository [main] Tika Parser Plug-in (parse-tika)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,758 INFO
o.a.n.p.PluginRepository [main] Extractor based XML/HTML Parser/Indexing
Filter (extractor)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,758 INFO
o.a.n.p.PluginRepository [main] Basic URL Normalizer (urlnormalizer-basic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,759 INFO
o.a.n.p.PluginRepository [main] Regex URL Filter Framework
(lib-regex-filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,760 INFO
o.a.n.p.PluginRepository [main] Regex URL Normalizer (urlnormalizer-regex)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,760 INFO
o.a.n.p.PluginRepository [main] CyberNeko HTML Parser (lib-nekohtml)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,761 INFO
o.a.n.p.PluginRepository [main] URL Validator (urlfilter-validator)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,761 INFO
o.a.n.p.PluginRepository [main] OPIC Scoring Plug-in (scoring-opic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,762 INFO
o.a.n.p.PluginRepository [main] Pass-through URL Normalizer
(urlnormalizer-pass)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,762 INFO
o.a.n.p.PluginRepository [main] Http Protocol Plug-in (protocol-http)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,763 INFO
o.a.n.p.PluginRepository [main] CSVIndexWriter (indexer-csv)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,763 INFO
o.a.n.p.PluginRepository [main] Registered Extension-Points:
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,764 INFO
o.a.n.p.PluginRepository [main] (Nutch Content Parser)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,764 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,765 INFO
o.a.n.p.PluginRepository [main] (HTML Parse Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,765 INFO
o.a.n.p.PluginRepository [main] (Nutch Scoring)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,766 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Normalizer)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,766 INFO
o.a.n.p.PluginRepository [main] (Nutch Publisher)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,767 INFO
o.a.n.p.PluginRepository [main] (Nutch Exchange)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,767 INFO
o.a.n.p.PluginRepository [main] (Nutch Protocol)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,768 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,768 INFO
o.a.n.p.PluginRepository [main] (Nutch Index Writer)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,769 INFO
o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,769 INFO
o.a.n.p.PluginRepository [main] (Nutch Indexing Filter)
org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:02,778 INFO
o.a.n.c.DeduplicationJob [main] DeduplicationJob: starting at 2022-11-18
07:48:02
org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:05,628 INFO
o.a.n.c.DeduplicationJob [main] Deduplication: 0 documents marked as
duplicates
org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:05,629 INFO
o.a.n.c.DeduplicationJob [main] Deduplication: Updating status of duplicate
urls into crawl db.
org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:06,996 INFO
o.a.n.c.DeduplicationJob [main] Deduplication finished at 2022-11-18
07:48:06, elapsed: 00:00:04
Indexing 20221118074241 to index
/home/paulesco/Downloads/apache-nutch-1.19/bin/nutch index
-Dmapreduce.job.reduces=2 -Dmapreduce.reduce.speculative=false
-Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress=true
/home/paulesco/Downloads/apache-nutch-1.19/crawl/crawldb -linkdb
/home/paulesco/Downloads/apache-nutch-1.19/crawl/linkdb
/home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221118074241
-deleteGone
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/paulesco/Downloads/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/paulesco/Downloads/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
org.apache.nutch.plugin.PluginManifestParser 2022-11-18 07:48:09,623 INFO
o.a.n.p.PluginManifestParser [main] Plugins: looking in:
/home/paulesco/Downloads/apache-nutch-1.19/plugins
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,111 INFO
o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true]
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,113 INFO
o.a.n.p.PluginRepository [main] Registered Plugins:
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,114 INFO
o.a.n.p.PluginRepository [main] Regex URL Filter (urlfilter-regex)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,114 INFO
o.a.n.p.PluginRepository [main] Html Parse Plug-in (parse-html)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,115 INFO
o.a.n.p.PluginRepository [main] HTTP Framework (lib-http)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,115 INFO
o.a.n.p.PluginRepository [main] the nutch core extension points
(nutch-extensionpoints)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,116 INFO
o.a.n.p.PluginRepository [main] Basic Indexing Filter (index-basic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,116 INFO
o.a.n.p.PluginRepository [main] Anchor Indexing Filter (index-anchor)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,117 INFO
o.a.n.p.PluginRepository [main] Tika Parser Plug-in (parse-tika)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,118 INFO
o.a.n.p.PluginRepository [main] Extractor based XML/HTML Parser/Indexing
Filter (extractor)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,118 INFO
o.a.n.p.PluginRepository [main] Basic URL Normalizer (urlnormalizer-basic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,119 INFO
o.a.n.p.PluginRepository [main] Regex URL Filter Framework
(lib-regex-filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,119 INFO
o.a.n.p.PluginRepository [main] Regex URL Normalizer (urlnormalizer-regex)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,120 INFO
o.a.n.p.PluginRepository [main] CyberNeko HTML Parser (lib-nekohtml)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,120 INFO
o.a.n.p.PluginRepository [main] URL Validator (urlfilter-validator)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,121 INFO
o.a.n.p.PluginRepository [main] OPIC Scoring Plug-in (scoring-opic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,122 INFO
o.a.n.p.PluginRepository [main] Pass-through URL Normalizer
(urlnormalizer-pass)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,122 INFO
o.a.n.p.PluginRepository [main] Http Protocol Plug-in (protocol-http)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,123 INFO
o.a.n.p.PluginRepository [main] CSVIndexWriter (indexer-csv)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,123 INFO
o.a.n.p.PluginRepository [main] Registered Extension-Points:
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,124 INFO
o.a.n.p.PluginRepository [main] (Nutch Content Parser)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,124 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,125 INFO
o.a.n.p.PluginRepository [main] (HTML Parse Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,125 INFO
o.a.n.p.PluginRepository [main] (Nutch Scoring)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,126 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Normalizer)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,126 INFO
o.a.n.p.PluginRepository [main] (Nutch Publisher)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,127 INFO
o.a.n.p.PluginRepository [main] (Nutch Exchange)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,127 INFO
o.a.n.p.PluginRepository [main] (Nutch Protocol)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,128 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,128 INFO
o.a.n.p.PluginRepository [main] (Nutch Index Writer)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,129 INFO
o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,129 INFO
o.a.n.p.PluginRepository [main] (Nutch Indexing Filter)
org.apache.nutch.segment.SegmentChecker 2022-11-18 07:48:10,617 INFO
o.a.n.s.SegmentChecker [main] Segment dir is complete:
/home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221118074241.
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,620 INFO
o.a.n.i.IndexingJob [main] Indexer: starting at 2022-11-18 07:48:10
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,634 INFO
o.a.n.i.IndexingJob [main] Indexer: deleting gone documents: true
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,634 INFO
o.a.n.i.IndexingJob [main] Indexer: URL filtering: false
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,635 INFO
o.a.n.i.IndexingJob [main] Indexer: URL normalizing: false
org.apache.nutch.indexer.IndexerMapReduce 2022-11-18 07:48:10,637 INFO
o.a.n.i.IndexerMapReduce [main] IndexerMapReduce: crawldb:
/home/paulesco/Downloads/apache-nutch-1.19/crawl/crawldb
org.apache.nutch.indexer.IndexerMapReduce 2022-11-18 07:48:10,642 INFO
o.a.n.i.IndexerMapReduce [main] IndexerMapReduces: adding segment:
/home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221118074241
org.apache.nutch.indexer.IndexerMapReduce 2022-11-18 07:48:10,644 INFO
o.a.n.i.IndexerMapReduce [main] IndexerMapReduce: linkdb:
/home/paulesco/Downloads/apache-nutch-1.19/crawl/linkdb
org.apache.nutch.indexer.IndexWriters 2022-11-18 07:48:13,788 INFO
o.a.n.i.IndexWriters [pool-5-thread-1] Index writer
org.apache.nutch.indexwriter.csv.CSVIndexWriter identified.
org.apache.nutch.exchange.Exchanges 2022-11-18 07:48:13,845 WARN
o.a.n.e.Exchanges [pool-5-thread-1] No exchange was configured. The
documents will be routed to all index writers.
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:13,848 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] separator = ,
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:13,880 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator
quotechar must be a char, only the first character '"' of """ is used
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:13,880 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] quotechar = "
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:13,881 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator
escapechar must be a char, only the first character '"' of """ is used
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:13,881 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] escapechar = "
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:13,882 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] valuesep = |
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,883
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldlength = 8096
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,884
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldvalues = 120
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,885
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] fields =
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,886
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] id
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,887
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] company
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,887
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] date
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,888
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobTitle
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,888
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobDescription
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,888
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] location
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,889
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] json
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,890
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Writing output to
csvindexwriter
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,891
WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Removing existing output
path csvindexwriter/nutch.csv
org.apache.nutch.indexer.IndexerOutputFormat 2022-11-18 07:48:14,059 INFO
o.a.n.i.IndexerOutputFormat [pool-5-thread-1] Active IndexWriters :
CSVIndexWriter:
┌──────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┐
│fields        │Ordered list of fields (columns) in the CSV file
│id,company,date,jobTitle,jobDescription,location,json│
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│separator     │Separator  between  fields  (columns),   default:   ,│,
                                                │
│              │(U+002C, comma)                                      │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│quotechar     │Quote  character  used  to  quote  fields  containing│"
                                                │
│              │separators or quotes, default: "  (U+0022,  quotation│
                                                 │
│              │mark)                                                │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│escapechar    │Escape character used to escape  a  quote  character,│"
                                                │
│              │default: " (U+0022, quotation mark)                  │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│valuesep      │Separator  between  multiple  values  of  one  field,│|
                                                │
│              │default: | (U+007C)                                  │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│maxfieldvalues│Max. number of values of one field, useful for, e.g.,│120
                                                │
│              │the anchor texts field, default: 12                  │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│maxfieldlength│Max. length of a single field  value  in  characters,│8096
                                                 │
│              │default: 4096                                        │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│charset       │Encoding of CSV file, default: UTF-8                 │UTF-8
                                                │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│header        │Write CSV column headers, default: true              │true
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│outpath       │Output path / directory, default: csvindexwriter.
  │csvindexwriter                                       │
└──────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘


org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2022-11-18
07:48:14,079 INFO o.a.n.i.a.AnchorIndexingFilter [pool-5-thread-1] Anchor
deduplication is: off
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by
com.sun.xml.bind.v2.runtime.reflect.opt.Injector$1
(file:/home/paulesco/Downloads/apache-nutch-1.19/lib/jaxb-impl-2.2.3-1.jar)
to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int)
WARNING: Please consider reporting this to the maintainers of
com.sun.xml.bind.v2.runtime.reflect.opt.Injector$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,875 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/administration-assistant-at-apple-3358665327?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=hPPT6HwfoeW5O5x3hD19Og%3D%3D&position=15&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,891 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/business-development-music-content-at-apple-3303474256?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=WixmspxoAN5LwMiK85fGTQ%3D%3D&position=13&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,894 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/business-marketing-and-g-a-internships-at-apple-3109770600?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=76Rvg5XTnq%2BMLXkyvInKEw%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,898 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/engineering-program-management-internship-at-apple-3178528752?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=AkNO4ulHoq2VdFGV8zrX7Q%3D%3D&position=14&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,900 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/executive-administrative-assistant-at-apple-3178549204?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=0tgIj1%2F3UsEYVTatO5k8AQ%3D%3D&position=5&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,905 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/full-stack-web-developer-early-career-at-apple-3178543696?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=ASc%2FwLZwb%2BWxgCMD98xZjA%3D%3D&position=10&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,908 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/instructional-designer-and-facilitator-at-apple-3311380419?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=8jWxwc90ubxidsR7yCUa8g%3D%3D&position=23&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,912 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/marketing-specialist-payments-at-apple-3295802145?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=moSai8myEFTiBHfy86ZdfQ%3D%3D&position=12&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,916 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/partner-relationship-manager-at-apple-3335905674?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=yQNQPxWYOe5pA2zSupCXhw%3D%3D&position=11&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,918 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-3083602420?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=syVQzNeq4uvv%2BV%2FnE5pMjw%3D%3D&position=9&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,921 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-3142389594?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=LtuRytaw2JrWIPBarIZPRA%3D%3D&position=8&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:14,924 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-3165763449?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=d3A78tGewvInBwuE1TY97A%3D%3D&position=4&pageNum=0&trk=public_jobs_jserp-result_search-card
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:14,930
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Finished CSV index in
csvindexwriter/nutch.csv
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:15,071 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] separator = ,
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:15,072 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator
quotechar must be a char, only the first character '"' of """ is used
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:15,072 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] quotechar = "
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:15,073 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator
escapechar must be a char, only the first character '"' of """ is used
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:15,073 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] escapechar = "
org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18
07:48:15,074 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] valuesep = |
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,074
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldlength = 8096
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,074
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldvalues = 120
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,075
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] fields =
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,075
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] id
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,076
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] company
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,076
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] date
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,077
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobTitle
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,077
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobDescription
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,077
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] location
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,078
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] json
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,079
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Writing output to
csvindexwriter
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,080
WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Removing existing output
path csvindexwriter/nutch.csv
org.apache.nutch.indexer.IndexerOutputFormat 2022-11-18 07:48:15,117 INFO
o.a.n.i.IndexerOutputFormat [pool-5-thread-1] Active IndexWriters :
CSVIndexWriter:
┌──────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┐
│fields        │Ordered list of fields (columns) in the CSV file
│id,company,date,jobTitle,jobDescription,location,json│
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│separator     │Separator  between  fields  (columns),   default:   ,│,
                                                │
│              │(U+002C, comma)                                      │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│quotechar     │Quote  character  used  to  quote  fields  containing│"
                                                │
│              │separators or quotes, default: "  (U+0022,  quotation│
                                                 │
│              │mark)                                                │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│escapechar    │Escape character used to escape  a  quote  character,│"
                                                │
│              │default: " (U+0022, quotation mark)                  │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│valuesep      │Separator  between  multiple  values  of  one  field,│|
                                                │
│              │default: | (U+007C)                                  │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│maxfieldvalues│Max. number of values of one field, useful for, e.g.,│120
                                                │
│              │the anchor texts field, default: 12                  │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│maxfieldlength│Max. length of a single field  value  in  characters,│8096
                                                 │
│              │default: 4096                                        │
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│charset       │Encoding of CSV file, default: UTF-8                 │UTF-8
                                                │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│header        │Write CSV column headers, default: true              │true
                                                 │
├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤
│outpath       │Output path / directory, default: csvindexwriter.
  │csvindexwriter                                       │
└──────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘


ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,154 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/content-strategist-at-apple-3183050156?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=3n3SZTr2DDL%2BuLJG80tF5A%3D%3D&position=17&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,158 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/corporate-fp-a-financial-analyst-at-apple-3299573611?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=v9%2F3SUQVjBpc7kyqFpz%2BGw%3D%3D&position=16&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,160 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/customer-support-account-representative-at-apple-3276378529?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=mcqQ08GV2r%2BhQGjrKUBV3g%3D%3D&position=24&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,164 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/executive-assistant-at-apple-3343515422?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=6GofJN8fsMPysOPQF4p%2FVA%3D%3D&position=25&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,168 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/global-supply-manager-at-apple-3122122362?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=6gEcpGvSLAZQDo0J6CEP5w%3D%3D&position=18&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,171 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/instructional-designer-and-facilitator-at-apple-3320714845?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=2LtFgvgbFnFky52wmV6%2BVw%3D%3D&position=22&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,173 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/instructional-designer-at-apple-3299571683?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=1O2wuFrYl7seVDay0vY9Dg%3D%3D&position=21&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,175 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/jr-software-developer-c-c%2B%2B-at-apple-2995935448?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=OoO8lg0lxNY3lZsoKICCJQ%3D%3D&position=20&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,178 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/partner-success-manager-at-apple-3238337934?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=jkjzk0WHT79R40TGmVOTsA%3D%3D&position=3&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,181 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/people-operations-hris-analyst-at-apple-3217837096?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=Gusmq8ZxlihLpNTzAXfPdg%3D%3D&position=19&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,184 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/people-support-specialist-at-apple-3296942621?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=tdx1V7OXKAuLLt76scpuaQ%3D%3D&position=7&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,187 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-2944352450?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=91p8jFJwx2KAh6bwE%2Bsv2Q%3D%3D&position=6&pageNum=0&trk=public_jobs_jserp-result_search-card
ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18
07:48:15,190 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1]
Indexing:
https://www.linkedin.com/jobs/view/software-engineering-internship-at-apple-3109778916?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=U0qyMZ4ai%2FquB19uZyoEKQ%3D%3D&position=2&pageNum=0&trk=public_jobs_jserp-result_search-card
org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,197
INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Finished CSV index in
csvindexwriter/nutch.csv
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:15,983 INFO
o.a.n.i.IndexingJob [main] Indexer: number of documents indexed, deleted,
or skipped:
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:15,999 INFO
o.a.n.i.IndexingJob [main] Indexer:     25  indexed (add/update)
org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:16,005 INFO
o.a.n.i.IndexingJob [main] Indexer: finished at 2022-11-18 07:48:15,
elapsed: 00:00:05
vie nov 18 07:48:16 -05 2022 : Finished loop with 2 iterations
-----------------------------------------------------------------------------------------------------------
index-writers.xml:

<writer id="indexer_csv_1"
class="org.apache.nutch.indexwriter.csv.CSVIndexWriter">
     <parameters>
       <!-- <param name="fields" value="id,title,content"/> -->
       <param name="fields"
value="id,company,date,jobTitle,jobDescription,location,json"/>
       <param name="charset" value="UTF-8"/>
       <param name="separator" value=","/>
       <param name="valuesep" value="|"/>
       <param name="quotechar" value="&quot;"/>
       <param name="escapechar" value="&quot;"/>
       <param name="maxfieldlength" value="8096"/>
       <param name="maxfieldvalues" value="120"/>
       <param name="header" value="true"/>
       <param name="outpath" value="csvindexwriter"/>
     </parameters>
     <mapping>
       <copy />
       <rename />
       <remove />
     </mapping>
   </writer>

I haven't mentioned but I'm using the Bayan Group extractor plugin to
extract some specific fields from linkedin job posts.

Thanks,




Reply via email to