[
https://issues.apache.org/jira/browse/NIFI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868856#comment-15868856
]
ASF GitHub Bot commented on NIFI-2613:
--------------------------------------
Github user jvwing commented on a diff in the pull request:
https://github.com/apache/nifi/pull/929#discussion_r101393181
--- Diff:
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java
---
@@ -0,0 +1,393 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.processors.poi;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.atomic.AtomicReference;
+
+import org.apache.commons.io.FilenameUtils;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.nifi.annotation.behavior.WritesAttribute;
+import org.apache.nifi.annotation.behavior.WritesAttributes;
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.flowfile.FlowFile;
+import org.apache.nifi.flowfile.attributes.CoreAttributes;
+import org.apache.nifi.processor.AbstractProcessor;
+import org.apache.nifi.processor.ProcessContext;
+import org.apache.nifi.processor.ProcessSession;
+import org.apache.nifi.processor.ProcessorInitializationContext;
+import org.apache.nifi.processor.Relationship;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.nifi.processor.io.StreamCallback;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
+import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
+import org.apache.poi.openxml4j.opc.OPCPackage;
+import org.apache.poi.xssf.eventusermodel.XSSFReader;
+import org.apache.poi.xssf.model.SharedStringsTable;
+import org.apache.poi.xssf.usermodel.XSSFRichTextString;
+import org.xml.sax.Attributes;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+import org.xml.sax.helpers.DefaultHandler;
+import org.xml.sax.helpers.XMLReaderFactory;
+
+
+@Tags({"excel", "csv", "poi"})
+@CapabilityDescription("Consumes a Microsoft Excel document and converts
each worksheet to csv. Each sheet from the incoming Excel " +
+ "document will generate a new Flowfile that will be output from
this processor. Each output Flowfile's contents will be formatted as a csv file
" +
+ "where the each row from the excel sheet is output as a newline in
the csv file.")
+@WritesAttributes({@WritesAttribute(attribute="sheetname",
description="The name of the Excel sheet that this particular row of data came
from in the Excel document"),
+ @WritesAttribute(attribute="numrows", description="The number of
rows in this Excel Sheet"),
--- End diff --
Can we clarify if this is the number of rows in the input spreadsheet, or
output rows in the flowfile?
> Support extracting content from Microsoft Excel (.xlxs) documents
> -----------------------------------------------------------------
>
> Key: NIFI-2613
> URL: https://issues.apache.org/jira/browse/NIFI-2613
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Extensions
> Reporter: Jeremy Dyer
> Assignee: Jeremy Dyer
>
> Microsoft Excel is a wildly popular application that businesses rely heavily
> on to store, visualize, and calculate data. Any single company most likely
> has thousands of Excel documents containing data that could be very valuable
> if ingested via NiFi and combined with other datasources. Apache POI is a
> popular 100% Java library for parsing several Microsoft document formats
> including Excel. Apache POI is extremely flexible and can do several things.
> This issue would focus solely on using Apache POI to parse an incoming .xlxs
> document and convert it to CSV. The processor should be capable of limiting
> which excel sheets. CSV seems like the natural choice for outputting each row
> since this feature is already available in Excel and feels very natural to
> most Excel sheet designs.
> This capability should most likely introduce a new "poi" module as I envision
> many more capabilities around parsing Microsoft documents could come from
> this base effort.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)