[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837378#comment-17837378
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1566194105


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest)
+returns (stream FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) 
+returns (stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;
+  string fetcher_class = 2;

Review Comment:
   string needed so people can dynamically add them. validation will make sure 
class exists and will return nice error message





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837377#comment-17837377
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1566193576


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}

Review Comment:
   added not-found detection





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832819#comment-17832819
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1546235205


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest)
+returns (stream FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) 
+returns (stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;
+  string fetcher_class = 2;

Review Comment:
   Should this be a protobuf enum containing the constrained set of classes? Or 
does Tika need to support arbitrary strings here in case of custom fetchers not 
included in the Tika project?
   





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832817#comment-17832817
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1546232766


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}

Review Comment:
   Could we document these RPCs to understand high level behaviour? For 
example, if I try to create a fetcher which already exists, what is the 
expected reply? Is it an error response on the RPC, will CreateFetchReply have 
error identifying information?
   
   If CreateFetcher was made idempotent, could we collapse these into a single 
RPC (UpdateFetcher), which either creates, or updates, or noops (no changes 
despite call) to the Fetcher?
   
   Don't want to over complicate the Tika side of course, but curious if we can 
improve the client interface.



##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}

Review Comment:
   Similar to above regarding documentation, it would be great to understand 
what happens if I try to get a fetcher which does not exist. Is there a 
distinct error, or do I simply get an empty GetFetcherReply?





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832676#comment-17832676
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1545858447


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,90 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+package tika;
+
+service Tika {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) returns (stream 
FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) returns 
(stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;

Review Comment:
   yes we are using "name" as the ID. @tballison any thoughts here? maybe we 
should rename that for 3.x





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832675#comment-17832675
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1545858324


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   i did the normal proto linter. i'm going to leave the other stuff there that 
buf extension stuff didn't see to add much value for my context and added hours





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832568#comment-17832568
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1545596130


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,90 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+package tika;
+
+service Tika {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) returns (stream 
FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) returns 
(stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;

Review Comment:
   Must `name` be unique across all initialized fetchers? `name` to me implies 
it's a descriptive label, is this more of an ID?
   
   Use case I am thinking of if I create multiple fetchers with the same class. 
Right now I would create a unique name for each one. Is that the correct 
expectation?





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832336#comment-17832336
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name 

[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832337#comment-17832337
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name 

[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-02-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814829#comment-17814829
 ] 

Tim Allison commented on TIKA-4181:
---

We recently had a request for something like what's in the diagram for 
tika-server: submit a fetch request, run the parse in a forked process 
(tika-pipes), but then instead of the emitter shipping off the results, the 
results are returned to the caller.  I _think_ this is what you describe in the 
diagram.

There is room in returned PipesResult for the full emitData. We need to modify 
the pipesClient to skip the usual emitting and return the full results of the 
parse for that request -- I think we can do that now by setting 
maxForEmitBatchBytes to a value < 0, but we should have a more elegant way of 
doing this.

WDYT?

> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-02-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814822#comment-17814822
 ] 

Tim Allison commented on TIKA-4181:
---

The response from the parse should be a PipesResult _and_ it should emit the 
contents of the parse to the specified emitter, no?



> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-01-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805762#comment-17805762
 ] 

Nicholas DiPiazza commented on TIKA-4181:
-

Tika pipes could get a full fledged service that could be tika-server-http2 to 
accompany tika-server and maybe one day replace it? 

Not sure the best way to handle packaging the app, but we could create a 
secondary main method for running the tika-pipes as a grpc service.

Then we would create a protobuf contract for each of the new services that we 
do:
 * pipe crud operations - create, update, delete, read, list, etc
 * run a pipe job - takes bidirectional streams of data - incoming=fetch 
metadata objects, outgoing=emitDocuments

So you would then provide a Go example and Java example generated from our 
protobuf schema.  that people could take and use

 

 

> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-01-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805652#comment-17805652
 ] 

Tim Allison commented on TIKA-4181:
---

Sorry, I shouldn't have responded on the dev list. We should discuss this here.

To confirm, this would be an alternative to tika-server?

The response for /pipes would be pipes/parse status -- did the 
fetch->parse->emit work. The response for /async would be "got the request" or 
"too many" requests.

I have a bunch of time on my hands :P and would be happy to chat via google 
meet or similar if that would be more efficient.

> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)